tests: database_types dual-use for benchmarks#67
Conversation
data_diff/database.py
Outdated
| def _query(self, sql_code: str) -> list: | ||
| "Uses the standard SQL cursor interface" | ||
| return _query_conn(self._conn, sql_code) | ||
| c = self._conn.cursor() |
There was a problem hiding this comment.
If this is Presto-specific, it should only be used for Presto
There was a problem hiding this comment.
I'm not sure I understand what you mean 👀 This is on the Presto driver
| f"size: {table2.max_key-table1.min_key}" | ||
| ) | ||
|
|
||
| # The entire segment wasn't below the threshold, but the next set of |
There was a problem hiding this comment.
This belongs in a separate PR
There was a problem hiding this comment.
Sure. The reason I included it here is because this is how the 'download' is a fair benchmark comparison where it splits into multiple segments, processing those in threads
| "2022-01-01 15:10:05.009900", | ||
| ], | ||
| "float": [0.0, 0.1, 0.10, 10.0, 100.98], | ||
| "int": IntFaker(N_SAMPLES), |
There was a problem hiding this comment.
This feels like a step backwards. How can we be sure that the "faker" creates all the edge cases that we are interested in?
There was a problem hiding this comment.
I definitely don't think this is a step backwards. It allows us to generate millions of samples predictably when we set N_SAMPLES higher for more comprehensive testing (and benchmarking), which is a far more thorough than anything we can type out manually.
It generates plenty of edge-cases naturally by creating lots of 9s, 8s, etc. in the first bit:
['2000-01-01 00:00:03.000571', '2000-01-01 00:00:06.001142', '2000-01-01 00:00:09.001713', '2000-01-01 00:00:12.002284', '2000-01-01 00:00:15.002855', '2000-01-01 00:00:18.003426', '2000-01-01 00:00:21.003997', '2000-01-01 00:00:24.004568', '2000-01-01 00:00:27.005139', '2000-01-01 00:00:30.005710', '2000-01-01 00:00:33.006281', '2000-01-01 00:00:36.006852', '2000-01-01 00:00:39.007423', '2000-01-01 00:00:42.007994', '2000-01-01 00:00:45.008565', '2000-01-01 00:00:48.009136', '2000-01-01 00:00:51.009707', '2000-01-01 00:00:54.010278', '2000-01-01 00:00:57.010849', '2000-01-01 00:01:00.011420', '2000-01-01 00:01:03.011991', '2000-01-01 00:01:06.012562', '2000-01-01 00:01:09.013133', '2000-01-01 00:01:12.013704', '2000-01-01 00:01:15.014275', '2000-01-01 00:01:18.014846', '2000-01-01 00:01:21.015417', '2000-01-01 00:01:24.015988', '2000-01-01 00:01:27.016559', '2000-01-01 00:01:30.017130', '2000-01-01 00:01:33.017701', '2000-01-01 00:01:36.018272', '2000-01-01 00:01:39.018843', '2000-01-01 00:01:42.019414', '2000-01-01 00:01:45.019985', '2000-01-01 00:01:48.020556', '2000-01-01 00:01:51.021127', '2000-01-01 00:01:54.021698', '2000-01-01 00:01:57.022269', '2000-01-01 00:02:00.022840', '2000-01-01 00:02:03.023411', '2000-01-01 00:02:06.023982', '2000-01-01 00:02:09.024553', '2000-01-01 00:02:12.025124', '2000-01-01 00:02:15.025695', '2000-01-01 00:02:18.026266', '2000-01-01 00:02:21.026837', '2000-01-01 00:02:24.027408', '2000-01-01 00:02:27.027979', '2000-01-01 00:02:30.028550']
There was a problem hiding this comment.
But there are edge-cases we already know are problematic. It doesn't make sense to throw them away, and hope we get them through an equation..
There was a problem hiding this comment.
Ok. I can hard-code the first M elements of the faker to the existing hard-coded values
tests/test_database_types.py
Outdated
| ) in source_type_categories.items(): # int, datetime, .. | ||
| for source_type in source_types: | ||
| for target_type in target_type_categories[type_category]: | ||
| for target_type in target_type_categories.get(type_category, []): |
There was a problem hiding this comment.
We'll never know if a typo caused some the tests to not run.
There was a problem hiding this comment.
Fair! Reason I had it was to just be able to comment out something quickly, but I see now that the risk isn't worth it. Agree
tests/test_database_types.py
Outdated
| dst_table = dst_conn.quote(".".join(dst_table_path)) | ||
|
|
||
| # If you want clean house from e.g. changing fakers. | ||
| src_conn.query(f"DROP TABLE IF EXISTS {src_table}", None) |
There was a problem hiding this comment.
Oracle doesn't support "DROP TABLE IF EXISTS"
There was a problem hiding this comment.
I'm actually going to remove/comment out these lines. The whole idea is to keep these tables around so we can re-run benchmarks without building up millions of rows.
There was a problem hiding this comment.
See my comment about old revisions.
| self.table = TableSegment(self.src_conn, src_table_path, "id", None, ("col",), quote_columns=False) | ||
| self.table2 = TableSegment(self.dst_conn, dst_table_path, "id", None, ("col",), quote_columns=False) | ||
| start = time.time() | ||
| already_seeded = self._create_table(dst_conn, dst_table, target_type) |
There was a problem hiding this comment.
We should be dropping the tables when we're done, no? So this exists only in case the tests fail.. And what if it's an old table with different values? Better to drop the tables and re-create everything.
There was a problem hiding this comment.
We keep the tables around on purpose for performance with large N_SAMPLES. Only if you change the fakers would you need to drop them.
There was a problem hiding this comment.
But that's exactly what's going to happen if I check-out some old revision on git. And I won't understand why the tests are behaving strangely.
There was a problem hiding this comment.
OK, that's fair. I'm thinking then I will add a benchmarking "mode" via an environment variable where you pass the sample numbers. We need to select a small subset of the tests for this either way. I'll comment it for now to get the numbers I need.
Biggest benefit of this PR is that running benchmarks through this suite is possible at all
tests/test_database_types.py
Outdated
| count2_duration = round(time.time() - start, 2) | ||
|
|
||
| # Ensure we actually checksum even for few samples! | ||
| ch_threshold = 5_000 if N_SAMPLES > DEFAULT_N_SAMPLES else 3 |
tests/test_database_types.py
Outdated
| self.assertEqual(len(sample_values), differ.stats.get("rows_downloaded", 0)) | ||
| download_duration = round(time.time() - start, 2) | ||
|
|
||
| print( |
There was a problem hiding this comment.
I thought we agreed tests don't print. Use the logger.
There was a problem hiding this comment.
Yep, for sure, just a debugging relic! Will log to debug!
tests/test_database_types.py
Outdated
| ) | ||
|
|
||
| def _create_table(self, conn, table, type) -> bool: | ||
| conn.query(f"CREATE TABLE IF NOT EXISTS {table}(id int, col {type})", None) |
There was a problem hiding this comment.
Not all dialects support IF NOT EXISTS here.
There was a problem hiding this comment.
Ah, is it Oracle? I've run this on all but Redshift + Oracle.
tests/test_database_types.py
Outdated
| return already_seeded | ||
|
|
||
| def _insert_to_table(self, conn, table, type, values): | ||
| conn.query(f"TRUNCATE {table}", None) |
There was a problem hiding this comment.
Did you make sure TRUNCATE exists in all dialects?
There was a problem hiding this comment.
It's worked on all I've seen so far. It's fairly standard SQL... I mean until we discover it's not.
data_diff/database.py
Outdated
| if sql_code.lower().startswith("select"): | ||
| return c.fetchall() | ||
| # Required for the query to actually run 🤯 | ||
| if re.match(r"\A(insert|create)", sql_code, re.IGNORECASE): |
There was a problem hiding this comment.
\A is not necessary. Python's match() doesn't scan the string.
03a4a06 to
86c3866
Compare
3a53501 to
99dc0ee
Compare
presto: fix test suite diff_tables: fix not double recursing tests: fix presto create indexes, etc., haven't tested all dbs more
99dc0ee to
e973e04
Compare
|
Most is addressed in #125 and |
@erezsh The idea is to dual-use this test-suite for benchmarking too by upping
N_SAMPLES. However, you'll want to run this for individual tests with a highN_SAMPLES... so it's a bit annoying, but workable.Plan is to use this to create the graphs.