New parser, bug fixes and new functionality. by collerek · Pull Request #121 · macbre/sql-metadata

collerek · 2021-04-27T17:37:40Z

Hi, first of all thanks for an awesome job with the library!
The rules you invented to classify given types of metadata (columns, tables etc.) were of great value!

Now I needed something like #78 so I had a go for it.

New Parser

In the process I read your #83 and #98 issues and they contain API that I like much more than the old one, so I decided to try implementing the new parser and iteration described in those issues along the way (more or less - i.e. you don't write how would you like to treat multi - token names (like db.schema.table etc.). So it's not exactly how you described it but it's close and I think this is the step in the right direction to get it how you see it in the future.

Functionalities

Apart from implementing #83 and #98 the new functionalities are:

Ability to get a dictionary containing column names divided by part of the query the column is located in (one of select, where, order_by, join, insert and update). Should close Identify which part of the query columns are a part of #78. (tested)
Support for extracting with statement names so they are excluded from tables. (note that I corrected the sql examples as in most backends (all supported?) the with as subquery should be wrapped in parenthesis). Should close missing tables returned by sql_metadata.get_query_tables #112 (tested)
Support for quoted parameters (basically removes them as in all supported backends they are used to wrap db identifiers, single quotes are used for strings). Should close Support Postgres quoted identifiers #85 (tested)
Introduce support for aliases without the as keyword. Should close get_query_table_aliases(query) is far from being final #97 (in much simpler way) (tested)
Introduce getting values list and values_dict (column, value pairs) from insert queries. Should close Best way to get an array of values? #103 (tested)
Introduce a way to extract subqueries with their names in a dict
Introduce a way to get subqueries names as a list
Introduce a way to get columns without the ones coming from subqueries
Introduce a way to extract inline comments as a list
In the process changes and fixes causes also missing tables returned by sql_metadata.get_query_tables (2) #120 to work correctly (tested)
Parser properties are cached (in "protected" vars) for subsequent quicker access

Bug fixes

Fixed yet unreported bugs:

Regex used to remove comments was greedy so if two (or more) comments were in a query it was matching also query text between the comments
Fix wrongly excluded columns in update queries after SET (update table set column = 'aa') etc.
Fix wrongly classified as columns names of the subqueries

Tests

All tests that were commented out or skipped now run and pass. I divided tests into smaller functions and multiple files to allow easier testing of selected sections of functionality if something fails. Coverage 100% and new tests for new functionalities.

So basically it should close all issues but #35 and at the same time introduces a lot of improvements etc.

Open issues

Now I also want to introduce column aliases but got stuck here a little as some issues require clarification.

Column aliases

As of now when you have a query like:
Select column1 as myname from table both column1 and myname go into the columns, while they are the same column.
I believe aliases like this should be resolved back to the most base column, and aliases like myname should not end up in columns but I need your opinion on this. That also leaves cases like select count(*) as counter, should counter be included in columns?

Motivation

My use case is that I need all real (and only real) database columns used in a parsed query to later reflect used databases and check column types. That means that (in perfect world) I would get:

recursively resolve column names so select col1 as myname from table should return only table.col1 in columns
recursively resolve names of columns from subqueries so select sub.test1 from (select aa as test1 from table) sub should return columns = ["table.aa"] as sub.test1 is a subquery alias and test1 is alias of aa.
try to assign table for each column even if prefix is not set so: select aa.col1, col2 from table aa should return ["table.col1", "table.col2"] and there should be a way of returning column with table in a dict. Of course if table cannot be resolved (like select aa, bb from tab1, tab2` where you cannot tell which column is from which table both column should return a list of ["tab1", "tab2"]

I can already get and extract subqueries but would need column aliases (and later optionally best match assigned tables to columns). But for that I need confirmation from you that I think right and aliases of columns should not end up in columns. Or alternatively there should be a way to get only aliases (as they are in end result of the query i.e. in select) or only the real database columns.

In my opinion the end product is much more concise, cleaner and easier to maintain, but it's basically re-written from scratch so let me know what you think! :) I updated readme so best to start there, rest is in tests since you do not have documentation.

Resolves #103, resolves #83, resolves #98, resolves #78 and resolves #120.

macbre · 2021-04-27T18:05:57Z

Wow, this is massive! Thanks, @collerek. I'll take a deeper look into your changes in the following days.

macbre · 2021-04-29T10:33:38Z

My use case is that I need all real (and only real) database columns used in a parsed query to later reflect used databases and check column types. That means that (in perfect world) I would get:

That's my idea behind aliases handling too. I'd like sql-metadat to return real columns that are used in the query, to allow other tools to analyse which columns from the real tables are used.

Hence, select count(*) as counter should not add any column to the list - counter is an alias of an aggregate function.

It's a more complex topic, so let's move the discussion to a different PR / issue and continue with the refactoring / new API here.

In my opinion the end product is much more concise, cleaner and easier to maintain, but it's basically re-written from scratch so let me know what you think! :) I updated readme so best to start there, rest is in tests since you do not have documentation.

I do agree! Thank you very much!

Co-authored-by: Maciej Brencz <maciej.brencz@gmail.com>

…x aliases test, remove double black from pyproject.toml, fix date function format in complex query in sql file

macbre · 2021-04-29T15:26:05Z

You want me to change the names or you will fix that?

Please do so.

Co-authored-by: Maciej Brencz <maciej.brencz@gmail.com>

macbre · 2021-04-29T15:28:03Z

Wow! That was quite a journey with code reviewing your changes, @collerek. Let's address the remaining open comments and we're ready to merge this one.

Once again - dzięki 🙂 🇵🇱

collerek · 2021-04-29T15:30:53Z

Hope that's not due to the crappy quality of my code 🤣

Polecam się na przyszłość i dzięki za super libkę 😉 🇵🇱

…tests and readme

macbre · 2021-04-29T15:36:48Z

Hope that's not due to the crappy quality of my code

Not at all.

collerek · 2021-04-29T15:59:14Z

Any reason you didn't mark #112, #85 and #97 as resolved?
Need more tests?

Let me know what are the next steps as I would like to help with column aliases if I can as this is my final goal :)

macbre · 2021-04-29T16:00:55Z

Any reason you didn't mark #112, #85 and #97 as resolved? Need more tests?

I just added this small comment to mark these additional issues as resolved once this PR is merged (via "resolved #..." GH feature). All is good. Sorry for the confusion here.

macbre · 2021-04-29T16:50:08Z

Let me know what are the next steps as I would like to help with column aliases if I can as this is my final goal :)

Let's tackle aliases in the next PR, ok? I'll complete the review and merge this one tomorrow.

collerek · 2021-04-29T16:54:05Z

Let's tackle aliases in the next PR, ok? I'll complete the review and merge this one tomorrow.

Sure, that's what I had in mind, saw that you added more issues to v.2.0 milestone that's why I'm asking if you are merging this one or wait for some other stuff.

We still have two new issues coming from this one (ability to pass file (or file path?) and refactoring token logic into token init, that's why I'm asking what is the plan, no rush or whatever, just curious :)

Co-authored-by: Maciej Brencz <maciej.brencz@gmail.com>

collerek · 2021-04-30T09:04:11Z

I just realized that removing comments in _preprocess in init is not really necessary and since we use sqlparse now in remove_comments it parses the query two times which hits performance, maybe we should revert and remove the step if it's working without it anyway (more tokens in parsed query with comments but I think it's still quicker than removing comments)?

Can you check performance on your machine? (as it is now and without step 3 in preprocess)

Co-authored-by: Maciej Brencz <maciej.brencz@gmail.com>

macbre · 2021-04-30T09:09:22Z

I just realized that removing comments in _preprocess in init is not really necessary and since we use sqlparse now in remove_comments it parses the query two times which hits performance, maybe we should revert and remove the step if it's working without it anyway (more tokens in parsed query with comments but I think it's still quicker than removing comments)?

Sure, please do that and add a test case in test_preprocessing with a query with a comment. This will make sure that we keep the behaviour of pre-processing consistent and don't break our contract.

sql-metadata user can still call parser.without_comments to get the query without comments.

macbre · 2021-04-30T09:20:15Z

+    # normalize newlines
+    assert (
+        Parser("SELECT /*my random comment*/ foo, id FROM `db`.`test`").query
+        == "SELECT /*my random comment*/ foo, id FROM db.test"
+    )
+


collerek · 2021-04-30T09:20:57Z

I just realized that removing comments in _preprocess in init is not really necessary and since we use sqlparse now in remove_comments it parses the query two times which hits performance, maybe we should revert and remove the step if it's working without it anyway (more tokens in parsed query with comments but I think it's still quicker than removing comments)?

Sure, please do that and add a test case in test_preprocessing with a query with a comment. This will make sure that we keep the behaviour of pre-processing consistent and don't break our contract.

sql-metadata user can still call parser.without_comments to get the query without comments.

I reverted the step 3 and added test in preprocess but since we don't remove it there is not much to test 😅
I added this step in one of the recent commits with switching to sqlparse, it was not there in old parser and in new parser just few commits back.

macbre · 2021-04-30T09:23:09Z

We're ready to merge once CI passes 🙂 Thanks for your great work here, @collerek.

Looking forward to your next PRs and ideas. Have a good one.

collerek added 7 commits April 22, 2021 20:51

first version passing all tests

8fc0aaa

simplified update_table part1

e5f3d9d

fix comments, add values, add values_dict, add columns_dict

79d678d

update readme

93179bb

add subqueries names support

88efae8

add extracting subqueries

84dac31

add test from macbre#120

01bafa2

macbre reviewed Apr 27, 2021

View reviewed changes

Comment thread pyproject.toml Outdated

macbre added enhancement internals labels Apr 27, 2021

macbre added this to the v2.0 milestone Apr 27, 2021