[fix](streaming-job) drop neighbour-table rows leaked by JDBC LIKE wildcards in JdbcPostgreSQLClient#63402
Conversation
…ldcards in JdbcPostgreSQLClient.getJdbcColumnsInfo
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
…Client and revert out-of-scope decoy added to mysql case
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
|
run buildall |
|
/review |
There was a problem hiding this comment.
I found one correctness issue in the PR's main fix. The approach is directionally correct for table-name wildcard leakage, but the same JDBC API also treats the schema argument as a LIKE pattern, so the current filter is incomplete for Postgres schemas containing _ or %.
Critical checkpoint conclusions:
- Goal/test: The PR targets leaked columns from PostgreSQL JDBC LIKE matching and adds a regression for table-name
_; it does not cover schema-pattern leakage, so the goal is only partially satisfied. - Scope: The change is small and focused.
- Concurrency/lifecycle/config/compatibility: No new concurrency, special lifecycle, config, storage format, or FE-BE protocol compatibility concerns found.
- Parallel paths: The modified PostgreSQL columns path is the relevant streaming-job path described by the PR; primary-key lookup uses a different JDBC call.
- Tests: The added regression covers the table-name wildcard case but misses the schema wildcard case.
- Observability/performance/transactions: No material concerns found.
User focus: No additional user-provided review focus was specified.
…st_streaming_postgres_job.groovy decoy block)
…IKE wildcards in JdbcPostgreSQLClient.getJdbcColumnsInfo
|
/review |
|
run buildall |
TPC-H: Total hot run time: 31614 ms |
TPC-DS: Total hot run time: 172798 ms |
There was a problem hiding this comment.
I found one remaining correctness gap in a parallel PostgreSQL JDBC metadata path. The new exact schema/table filtering fixes fe-core's JdbcPostgreSQLClient, and the added regression covers the streaming-job path including the schema-wildcard case that was already raised in the existing thread. However, fe/fe-connector/fe-connector-jdbc/src/main/java/org/apache/doris/connector/jdbc/client/JdbcPostgreSQLConnectorClient.java is an adapted PostgreSQL client and still builds schemas directly from DatabaseMetaData.getColumns(cat, remoteDbName, remoteTableName, null) without exact TABLE_SCHEM/TABLE_NAME filtering. Users of that connector path can still get columns mixed from wildcard-matching schemas/tables.
Critical checkpoint conclusions:
- Goal: prevent JDBC LIKE wildcard leakage in PostgreSQL column discovery. Partially accomplished; legacy FE datasource path is fixed, connector path remains vulnerable.
- Scope/focus: the code change itself is small and clear, but incomplete across functionally parallel code paths.
- Concurrency/lifecycle/config/compatibility: no new concurrency, lifecycle, config, storage-format, or FE-BE protocol concerns found.
- Parallel paths: issue found in
JdbcPostgreSQLConnectorClient.getJdbcColumnsInfo. - Special checks: exact filtering is appropriate at the
ResultSetconsumption point. - Tests: regression covers the streaming-job path and both table/schema wildcard decoys; no coverage for the
fe-connectorPostgreSQL path. - Observability/transactions/persistence/data writes: not applicable to this metadata-only change.
- Performance: the added per-row string checks are negligible relative to JDBC metadata IO.
- User focus: no additional user-provided review focus was specified.
FE Regression Coverage ReportIncrement line coverage |
#63404 #63471 #63480 #63490 #63514 #63618 (#63812) Cherry-picked from: - #63079 [improve](streaming-job) async chunk splitting for cdc source job - #63404 [test](streaming-job) refine cdc data-type and boundary regression cases for mysql/pg - #63471 [regression-test](streaming-job) add cdc cases (composite/concurrent-dml/id-gap/decimal/datetime pk) and fix split-bound java.time deserialize - #63480 [fix](streaming-job) misc fixes for typo/log/validation/visibility - #63402 [fix](streaming-job) drop neighbour-table rows leaked by JDBC LIKE wildcards in JdbcPostgreSQLClient - #63514 [regression-test](streaming-job) add cdc operational cases for offset modes and pg slot lifecycle - #63618 [fix](streaming-job) fix postgres historical-date timestamp handling in cdc-client - #63490 [improve](streaming-job) support user-specified mysql server_id with per-reader assignment
…ldcards in JdbcPostgreSQLClient (apache#63402) ### What problem does this PR solve? `JdbcPostgreSQLClient.getJdbcColumnsInfo` calls `DatabaseMetaData.getColumns(catalog, schemaPattern, tableNamePattern, columnNamePattern)`. Per the JDBC spec the 3rd argument is a **SQL LIKE pattern**, so literal `_` / `%` characters in the requested table name are interpreted as wildcards by the Postgres driver. When a streaming job is created with `include_tables = "user_info_pg_normal1"` and a neighbour table like `userXinfo_pg_normal1` happens to coexist in the same schema, the metadata query returns columns from **both** tables. The combined result then trips `CREATE TABLE` on the Doris side with errors such as `errCode = 2, detailMessage = Duplicate column name 'name'`, or pollutes the auto-created table schema with stray columns. The repro is trivial: in the same Postgres schema create - `user_info_pg_normal1(name varchar, age int2)` — the table we want to capture - `userXinfo_pg_normal1(name varchar, weight float8)` — a decoy whose name only differs from the target by a single character that `_` matches then run `CREATE JOB ... include_tables = "user_info_pg_normal1"`. Without the fix the schema fetched for the target leaks `weight` (or `Duplicate column name 'name'`, depending on column order). Fix: after fetching the `ResultSet`, drop rows whose `TABLE_NAME` does not exactly equal the requested `remoteTableName`. We deliberately do **not** escape `_` / `%` at the source — relying on `DatabaseMetaData.getSearchStringEscape()` is driver-version dependent (older Oracle drivers don't honour escape sequences in `getTables`), while filtering on the consumer side is deterministic and driver-agnostic. Scope: - Only `JdbcPostgreSQLClient` is patched. This is the path used by Postgres streaming jobs (the failing case). MySQL streaming jobs were checked against the same decoy pattern and do not reproduce the bug because MySQL Connector/J doesn't pull neighbour rows here in practice — so `JdbcMySQLClient` is left untouched in this PR. - The JDBC catalog path lives in a separate module (`fe-connector-jdbc/.../JdbcConnectorClient`) and is **not** part of this PR. It already does partial escape but intentionally skips `_` / `%` for driver-compatibility reasons; a follow-up can apply the same after-the-fact filter there.
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
JdbcPostgreSQLClient.getJdbcColumnsInfocallsDatabaseMetaData.getColumns(catalog, schemaPattern, tableNamePattern, columnNamePattern). Per the JDBC spec the 3rd argument is a SQL LIKE pattern, so literal_/%characters in the requested table name are interpreted as wildcards by the Postgres driver. When a streaming job is created withinclude_tables = "user_info_pg_normal1"and a neighbour table likeuserXinfo_pg_normal1happens to coexist in the same schema, the metadata query returns columns from both tables. The combined result then tripsCREATE TABLEon the Doris side with errors such aserrCode = 2, detailMessage = Duplicate column name 'name', or pollutes the auto-created table schema with stray columns.The repro is trivial: in the same Postgres schema create
user_info_pg_normal1(name varchar, age int2)— the table we want to captureuserXinfo_pg_normal1(name varchar, weight float8)— a decoy whose name only differs from the target by a single character that_matchesthen run
CREATE JOB ... include_tables = "user_info_pg_normal1". Without the fix the schema fetched for the target leaksweight(orDuplicate column name 'name', depending on column order).Fix: after fetching the
ResultSet, drop rows whoseTABLE_NAMEdoes not exactly equal the requestedremoteTableName. We deliberately do not escape_/%at the source — relying onDatabaseMetaData.getSearchStringEscape()is driver-version dependent (older Oracle drivers don't honour escape sequences ingetTables), while filtering on the consumer side is deterministic and driver-agnostic.Scope:
JdbcPostgreSQLClientis patched. This is the path used by Postgres streaming jobs (the failing case). MySQL streaming jobs were checked against the same decoy pattern and do not reproduce the bug because MySQL Connector/J doesn't pull neighbour rows here in practice — soJdbcMySQLClientis left untouched in this PR.fe-connector-jdbc/.../JdbcConnectorClient) and is not part of this PR. It already does partial escape but intentionally skips_/%for driver-compatibility reasons; a follow-up can apply the same after-the-fact filter there.Release note
None
Check List (For Author)
A decoy table
userXinfo_pg_normal1(with a different column shape:weight float8) is added totest_streaming_postgres_job.groovy, plus anassert !createTalInfo.contains(\weight`)guard. Without the fix the case fails (eitherDuplicate column nameduringCREATE TABLE, or theweightassert trips). The same decoy is mirrored intotest_streaming_mysql_job.groovy` as a baseline so any future regression in MySQL Connector/J's behaviour is caught immediately.Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)