Skip to content

New parser, bug fixes and new functionality.#121

Merged
macbre merged 23 commits into
macbre:masterfrom
collerek:new_parser
Apr 30, 2021
Merged

New parser, bug fixes and new functionality.#121
macbre merged 23 commits into
macbre:masterfrom
collerek:new_parser

Conversation

@collerek
Copy link
Copy Markdown
Collaborator

@collerek collerek commented Apr 27, 2021

Hi, first of all thanks for an awesome job with the library!
The rules you invented to classify given types of metadata (columns, tables etc.) were of great value!

Now I needed something like #78 so I had a go for it.

New Parser

In the process I read your #83 and #98 issues and they contain API that I like much more than the old one, so I decided to try implementing the new parser and iteration described in those issues along the way (more or less - i.e. you don't write how would you like to treat multi - token names (like db.schema.table etc.). So it's not exactly how you described it but it's close and I think this is the step in the right direction to get it how you see it in the future.

Functionalities

Apart from implementing #83 and #98 the new functionalities are:

Bug fixes

Fixed yet unreported bugs:

  • Regex used to remove comments was greedy so if two (or more) comments were in a query it was matching also query text between the comments
  • Fix wrongly excluded columns in update queries after SET (update table set column = 'aa') etc.
  • Fix wrongly classified as columns names of the subqueries

Tests

All tests that were commented out or skipped now run and pass. I divided tests into smaller functions and multiple files to allow easier testing of selected sections of functionality if something fails. Coverage 100% and new tests for new functionalities.

So basically it should close all issues but #35 and at the same time introduces a lot of improvements etc.

Open issues

Now I also want to introduce column aliases but got stuck here a little as some issues require clarification.

Column aliases

As of now when you have a query like:
Select column1 as myname from table both column1 and myname go into the columns, while they are the same column.
I believe aliases like this should be resolved back to the most base column, and aliases like myname should not end up in columns but I need your opinion on this. That also leaves cases like select count(*) as counter, should counter be included in columns?

Motivation

My use case is that I need all real (and only real) database columns used in a parsed query to later reflect used databases and check column types. That means that (in perfect world) I would get:

  • recursively resolve column names so select col1 as myname from table should return only table.col1 in columns
  • recursively resolve names of columns from subqueries so select sub.test1 from (select aa as test1 from table) sub should return columns = ["table.aa"] as sub.test1 is a subquery alias and test1 is alias of aa.
  • try to assign table for each column even if prefix is not set so: select aa.col1, col2 from table aa should return ["table.col1", "table.col2"] and there should be a way of returning column with table in a dict. Of course if table cannot be resolved (like select aa, bb from tab1, tab2` where you cannot tell which column is from which table both column should return a list of ["tab1", "tab2"]

I can already get and extract subqueries but would need column aliases (and later optionally best match assigned tables to columns). But for that I need confirmation from you that I think right and aliases of columns should not end up in columns. Or alternatively there should be a way to get only aliases (as they are in end result of the query i.e. in select) or only the real database columns.

In my opinion the end product is much more concise, cleaner and easier to maintain, but it's basically re-written from scratch so let me know what you think! :) I updated readme so best to start there, rest is in tests since you do not have documentation.


Resolves #103, resolves #83, resolves #98, resolves #78 and resolves #120.

Comment thread pyproject.toml Outdated
@macbre
Copy link
Copy Markdown
Owner

macbre commented Apr 27, 2021

Wow, this is massive! Thanks, @collerek. I'll take a deeper look into your changes in the following days.

@macbre macbre added this to the v2.0 milestone Apr 27, 2021
@macbre
Copy link
Copy Markdown
Owner

macbre commented Apr 29, 2021

My use case is that I need all real (and only real) database columns used in a parsed query to later reflect used databases and check column types. That means that (in perfect world) I would get:

That's my idea behind aliases handling too. I'd like sql-metadat to return real columns that are used in the query, to allow other tools to analyse which columns from the real tables are used.

Hence, select count(*) as counter should not add any column to the list - counter is an alias of an aggregate function.

It's a more complex topic, so let's move the discussion to a different PR / issue and continue with the refactoring / new API here.

In my opinion the end product is much more concise, cleaner and easier to maintain, but it's basically re-written from scratch so let me know what you think! :) I updated readme so best to start there, rest is in tests since you do not have documentation.

I do agree! Thank you very much!

Comment thread test/test_getting_tables.py Outdated
Comment thread sql_metadata/generalizator.py Outdated
Comment thread sql_metadata/parser.py Outdated
Comment thread sql_metadata/token.py
Comment thread sql_metadata/parser.py
collerek and others added 2 commits April 29, 2021 13:07
Co-authored-by: Maciej Brencz <maciej.brencz@gmail.com>
Co-authored-by: Maciej Brencz <maciej.brencz@gmail.com>
Comment thread sql_metadata/keywords_lists.py Outdated
Comment thread test/test.sql Outdated
Comment thread test/test_caching.py Outdated
Comment thread test/test_complex_aliases.py
Comment thread test/test_complex_aliases.py Outdated
collerek and others added 3 commits April 29, 2021 15:32
Co-authored-by: Maciej Brencz <maciej.brencz@gmail.com>
…x aliases test, remove double black from pyproject.toml, fix date function format in complex query in sql file
Comment thread test/test_caching.py
Comment thread .github/workflows/python-ci.yml
@macbre
Copy link
Copy Markdown
Owner

macbre commented Apr 29, 2021

You want me to change the names or you will fix that?

Please do so.

Co-authored-by: Maciej Brencz <maciej.brencz@gmail.com>
@macbre
Copy link
Copy Markdown
Owner

macbre commented Apr 29, 2021

Wow! That was quite a journey with code reviewing your changes, @collerek. Let's address the remaining open comments and we're ready to merge this one.

Once again - dzięki 🙂 🇵🇱

@collerek
Copy link
Copy Markdown
Collaborator Author

Hope that's not due to the crappy quality of my code 🤣

Polecam się na przyszłość i dzięki za super libkę 😉 🇵🇱

@macbre
Copy link
Copy Markdown
Owner

macbre commented Apr 29, 2021

Hope that's not due to the crappy quality of my code

Not at all.

@collerek
Copy link
Copy Markdown
Collaborator Author

Any reason you didn't mark #112, #85 and #97 as resolved?
Need more tests?

Let me know what are the next steps as I would like to help with column aliases if I can as this is my final goal :)

@macbre
Copy link
Copy Markdown
Owner

macbre commented Apr 29, 2021

Any reason you didn't mark #112, #85 and #97 as resolved? Need more tests?

I just added this small comment to mark these additional issues as resolved once this PR is merged (via "resolved #..." GH feature). All is good. Sorry for the confusion here.

@macbre
Copy link
Copy Markdown
Owner

macbre commented Apr 29, 2021

Let me know what are the next steps as I would like to help with column aliases if I can as this is my final goal :)

Let's tackle aliases in the next PR, ok? I'll complete the review and merge this one tomorrow.

@collerek
Copy link
Copy Markdown
Collaborator Author

Let's tackle aliases in the next PR, ok? I'll complete the review and merge this one tomorrow.

Sure, that's what I had in mind, saw that you added more issues to v.2.0 milestone that's why I'm asking if you are merging this one or wait for some other stuff.

We still have two new issues coming from this one (ability to pass file (or file path?) and refactoring token logic into token init, that's why I'm asking what is the plan, no rush or whatever, just curious :)

Comment thread test/test_normalization.py Outdated
Co-authored-by: Maciej Brencz <maciej.brencz@gmail.com>
@collerek
Copy link
Copy Markdown
Collaborator Author

I just realized that removing comments in _preprocess in init is not really necessary and since we use sqlparse now in remove_comments it parses the query two times which hits performance, maybe we should revert and remove the step if it's working without it anyway (more tokens in parsed query with comments but I think it's still quicker than removing comments)?

Can you check performance on your machine? (as it is now and without step 3 in preprocess)

Comment thread sql_metadata/parser.py Outdated
Co-authored-by: Maciej Brencz <maciej.brencz@gmail.com>
@macbre
Copy link
Copy Markdown
Owner

macbre commented Apr 30, 2021

I just realized that removing comments in _preprocess in init is not really necessary and since we use sqlparse now in remove_comments it parses the query two times which hits performance, maybe we should revert and remove the step if it's working without it anyway (more tokens in parsed query with comments but I think it's still quicker than removing comments)?

Sure, please do that and add a test case in test_preprocessing with a query with a comment. This will make sure that we keep the behaviour of pre-processing consistent and don't break our contract.

sql-metadata user can still call parser.without_comments to get the query without comments.

Comment thread test/test_query.py Outdated
Comment on lines +33 to +38
# normalize newlines
assert (
Parser("SELECT /*my random comment*/ foo, id FROM `db`.`test`").query
== "SELECT /*my random comment*/ foo, id FROM db.test"
)

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👌🏻

@collerek
Copy link
Copy Markdown
Collaborator Author

I just realized that removing comments in _preprocess in init is not really necessary and since we use sqlparse now in remove_comments it parses the query two times which hits performance, maybe we should revert and remove the step if it's working without it anyway (more tokens in parsed query with comments but I think it's still quicker than removing comments)?

Sure, please do that and add a test case in test_preprocessing with a query with a comment. This will make sure that we keep the behaviour of pre-processing consistent and don't break our contract.

sql-metadata user can still call parser.without_comments to get the query without comments.

I reverted the step 3 and added test in preprocess but since we don't remove it there is not much to test 😅
I added this step in one of the recent commits with switching to sqlparse, it was not there in old parser and in new parser just few commits back.

Comment thread test/test_query.py Outdated
@macbre macbre enabled auto-merge April 30, 2021 09:22
@macbre macbre merged commit dce28a7 into macbre:master Apr 30, 2021
@macbre
Copy link
Copy Markdown
Owner

macbre commented Apr 30, 2021

We're ready to merge once CI passes 🙂 Thanks for your great work here, @collerek.

Looking forward to your next PRs and ideas. Have a good one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment