Skip to content

[fix](streaming-job) drop neighbour-table rows leaked by JDBC LIKE wildcards in JdbcPostgreSQLClient#63402

Merged
JNSimba merged 4 commits into
apache:masterfrom
JNSimba:fix/jdbc-escape-like-wildcards
May 26, 2026
Merged

[fix](streaming-job) drop neighbour-table rows leaked by JDBC LIKE wildcards in JdbcPostgreSQLClient#63402
JNSimba merged 4 commits into
apache:masterfrom
JNSimba:fix/jdbc-escape-like-wildcards

Conversation

@JNSimba

@JNSimba JNSimba commented May 19, 2026

Copy link
Copy Markdown
Member

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

JdbcPostgreSQLClient.getJdbcColumnsInfo calls DatabaseMetaData.getColumns(catalog, schemaPattern, tableNamePattern, columnNamePattern). Per the JDBC spec the 3rd argument is a SQL LIKE pattern, so literal _ / % characters in the requested table name are interpreted as wildcards by the Postgres driver. When a streaming job is created with include_tables = "user_info_pg_normal1" and a neighbour table like userXinfo_pg_normal1 happens to coexist in the same schema, the metadata query returns columns from both tables. The combined result then trips CREATE TABLE on the Doris side with errors such as errCode = 2, detailMessage = Duplicate column name 'name', or pollutes the auto-created table schema with stray columns.

The repro is trivial: in the same Postgres schema create

  • user_info_pg_normal1(name varchar, age int2) — the table we want to capture
  • userXinfo_pg_normal1(name varchar, weight float8) — a decoy whose name only differs from the target by a single character that _ matches

then run CREATE JOB ... include_tables = "user_info_pg_normal1". Without the fix the schema fetched for the target leaks weight (or Duplicate column name 'name', depending on column order).

Fix: after fetching the ResultSet, drop rows whose TABLE_NAME does not exactly equal the requested remoteTableName. We deliberately do not escape _ / % at the source — relying on DatabaseMetaData.getSearchStringEscape() is driver-version dependent (older Oracle drivers don't honour escape sequences in getTables), while filtering on the consumer side is deterministic and driver-agnostic.

Scope:

  • Only JdbcPostgreSQLClient is patched. This is the path used by Postgres streaming jobs (the failing case). MySQL streaming jobs were checked against the same decoy pattern and do not reproduce the bug because MySQL Connector/J doesn't pull neighbour rows here in practice — so JdbcMySQLClient is left untouched in this PR.
  • The JDBC catalog path lives in a separate module (fe-connector-jdbc/.../JdbcConnectorClient) and is not part of this PR. It already does partial escape but intentionally skips _ / % for driver-compatibility reasons; a follow-up can apply the same after-the-fact filter there.

Release note

None

Check List (For Author)

  • Test
    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason

A decoy table userXinfo_pg_normal1 (with a different column shape: weight float8) is added to test_streaming_postgres_job.groovy, plus an assert !createTalInfo.contains(\weight`)guard. Without the fix the case fails (eitherDuplicate column nameduringCREATE TABLE, or the weightassert trips). The same decoy is mirrored intotest_streaming_mysql_job.groovy` as a baseline so any future regression in MySQL Connector/J's behaviour is caught immediately.

  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

…ldcards in JdbcPostgreSQLClient.getJdbcColumnsInfo
@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

…Client and revert out-of-scope decoy added to mysql case
liaoxin01
liaoxin01 previously approved these changes May 26, 2026

@liaoxin01 liaoxin01 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label May 26, 2026
@github-actions

Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions

Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

@JNSimba

JNSimba commented May 26, 2026

Copy link
Copy Markdown
Member Author

run buildall

@JNSimba

JNSimba commented May 26, 2026

Copy link
Copy Markdown
Member Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one correctness issue in the PR's main fix. The approach is directionally correct for table-name wildcard leakage, but the same JDBC API also treats the schema argument as a LIKE pattern, so the current filter is incomplete for Postgres schemas containing _ or %.

Critical checkpoint conclusions:

  • Goal/test: The PR targets leaked columns from PostgreSQL JDBC LIKE matching and adds a regression for table-name _; it does not cover schema-pattern leakage, so the goal is only partially satisfied.
  • Scope: The change is small and focused.
  • Concurrency/lifecycle/config/compatibility: No new concurrency, special lifecycle, config, storage format, or FE-BE protocol compatibility concerns found.
  • Parallel paths: The modified PostgreSQL columns path is the relevant streaming-job path described by the PR; primary-key lookup uses a different JDBC call.
  • Tests: The added regression covers the table-name wildcard case but misses the schema wildcard case.
  • Observability/performance/transactions: No material concerns found.

User focus: No additional user-provided review focus was specified.

…st_streaming_postgres_job.groovy decoy block)
@github-actions github-actions Bot removed the approved Indicates a PR has been approved by one committer. label May 26, 2026
…IKE wildcards in JdbcPostgreSQLClient.getJdbcColumnsInfo
@JNSimba

JNSimba commented May 26, 2026

Copy link
Copy Markdown
Member Author

/review

@JNSimba

JNSimba commented May 26, 2026

Copy link
Copy Markdown
Member Author

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 31614 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit b9107947bb0fd67973c72d4792cd91618a3b2a1f, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17685	4067	4010	4010
q2	q3	10773	1376	851	851
q4	4690	475	354	354
q5	7627	2271	2096	2096
q6	261	176	144	144
q7	942	821	635	635
q8	9352	1780	1650	1650
q9	6500	4955	4881	4881
q10	6458	2252	1858	1858
q11	440	270	242	242
q12	688	430	295	295
q13	18212	3366	2787	2787
q14	265	254	235	235
q15	q16	816	779	700	700
q17	883	894	875	875
q18	6947	5785	5556	5556
q19	1182	1292	1212	1212
q20	535	454	271	271
q21	5989	2782	2638	2638
q22	458	386	324	324
Total cold run time: 100703 ms
Total hot run time: 31614 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4776	4753	4976	4753
q2	q3	4901	5274	4665	4665
q4	2155	2218	1440	1440
q5	4928	4746	4746	4746
q6	240	181	124	124
q7	1799	1978	1579	1579
q8	2453	1978	1961	1961
q9	7404	7504	7375	7375
q10	4807	4699	4253	4253
q11	543	391	358	358
q12	738	753	547	547
q13	2962	3351	2789	2789
q14	274	278	253	253
q15	q16	684	707	614	614
q17	1302	1277	1263	1263
q18	7347	6734	6847	6734
q19	1151	1109	1109	1109
q20	2229	2219	1969	1969
q21	5343	4656	4528	4528
q22	521	482	405	405
Total cold run time: 56557 ms
Total hot run time: 51465 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 172798 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit b9107947bb0fd67973c72d4792cd91618a3b2a1f, data reload: false

query5	4317	675	518	518
query6	335	232	200	200
query7	4211	546	313	313
query8	330	242	222	222
query9	8849	4126	4129	4126
query10	460	350	306	306
query11	5841	2600	2230	2230
query12	187	130	127	127
query13	1332	612	465	465
query14	6170	5578	5256	5256
query14_1	4578	4571	4555	4555
query15	217	207	184	184
query16	983	452	460	452
query17	1068	757	622	622
query18	2456	503	376	376
query19	218	211	167	167
query20	142	144	133	133
query21	219	143	117	117
query22	13713	13719	13376	13376
query23	17430	16557	16268	16268
query23_1	16462	16398	16432	16398
query24	7459	1796	1324	1324
query24_1	1339	1327	1344	1327
query25	592	527	442	442
query26	1309	317	180	180
query27	2692	571	361	361
query28	4441	1979	2001	1979
query29	1051	668	522	522
query30	313	253	202	202
query31	1135	1092	989	989
query32	95	82	78	78
query33	567	395	309	309
query34	1176	1134	658	658
query35	820	809	689	689
query36	1420	1364	1245	1245
query37	162	117	88	88
query38	3239	3180	3113	3113
query39	936	940	909	909
query39_1	886	864	881	864
query40	244	153	131	131
query41	71	70	70	70
query42	120	113	118	113
query43	351	346	312	312
query44	
query45	218	212	205	205
query46	1119	1199	756	756
query47	2364	2426	2351	2351
query48	406	409	302	302
query49	640	504	383	383
query50	1004	354	261	261
query51	4345	4304	4231	4231
query52	103	105	95	95
query53	261	288	206	206
query54	316	274	257	257
query55	93	92	90	90
query56	310	309	292	292
query57	1442	1432	1357	1357
query58	305	275	270	270
query59	1597	1665	1473	1473
query60	334	327	310	310
query61	164	147	159	147
query62	692	649	599	599
query63	240	205	213	205
query64	2397	816	628	628
query65	
query66	1732	486	360	360
query67	29920	29845	29755	29755
query68	
query69	463	339	306	306
query70	1016	1079	997	997
query71	304	273	267	267
query72	3032	2660	2405	2405
query73	822	755	418	418
query74	5145	5031	4844	4844
query75	2707	2636	2319	2319
query76	2294	1147	781	781
query77	411	417	347	347
query78	12508	12392	11876	11876
query79	1523	1029	770	770
query80	1236	547	461	461
query81	506	292	243	243
query82	1449	155	125	125
query83	383	284	247	247
query84	287	142	112	112
query85	946	545	454	454
query86	435	335	334	334
query87	3496	3440	3247	3247
query88	3616	2735	2730	2730
query89	454	392	348	348
query90	1777	186	184	184
query91	179	167	137	137
query92	80	81	72	72
query93	1438	1516	883	883
query94	662	343	303	303
query95	697	399	356	356
query96	1125	775	339	339
query97	2724	2747	2575	2575
query98	245	229	231	229
query99	1184	1146	1052	1052
Total cold run time: 256058 ms
Total hot run time: 172798 ms

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one remaining correctness gap in a parallel PostgreSQL JDBC metadata path. The new exact schema/table filtering fixes fe-core's JdbcPostgreSQLClient, and the added regression covers the streaming-job path including the schema-wildcard case that was already raised in the existing thread. However, fe/fe-connector/fe-connector-jdbc/src/main/java/org/apache/doris/connector/jdbc/client/JdbcPostgreSQLConnectorClient.java is an adapted PostgreSQL client and still builds schemas directly from DatabaseMetaData.getColumns(cat, remoteDbName, remoteTableName, null) without exact TABLE_SCHEM/TABLE_NAME filtering. Users of that connector path can still get columns mixed from wildcard-matching schemas/tables.

Critical checkpoint conclusions:

  • Goal: prevent JDBC LIKE wildcard leakage in PostgreSQL column discovery. Partially accomplished; legacy FE datasource path is fixed, connector path remains vulnerable.
  • Scope/focus: the code change itself is small and clear, but incomplete across functionally parallel code paths.
  • Concurrency/lifecycle/config/compatibility: no new concurrency, lifecycle, config, storage-format, or FE-BE protocol concerns found.
  • Parallel paths: issue found in JdbcPostgreSQLConnectorClient.getJdbcColumnsInfo.
  • Special checks: exact filtering is appropriate at the ResultSet consumption point.
  • Tests: regression covers the streaming-job path and both table/schema wildcard decoys; no coverage for the fe-connector PostgreSQL path.
  • Observability/transactions/persistence/data writes: not applicable to this metadata-only change.
  • Performance: the added per-row string checks are negligible relative to JDBC metadata IO.
  • User focus: no additional user-provided review focus was specified.

@hello-stephen

Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 0.00% (0/87) 🎉
Increment coverage report
Complete coverage report

@JNSimba JNSimba merged commit 41581e5 into apache:master May 26, 2026
31 checks passed
yiguolei pushed a commit that referenced this pull request May 29, 2026
#63404 #63471 #63480 #63490 #63514 #63618 (#63812)

Cherry-picked from:

- #63079 [improve](streaming-job) async chunk splitting for cdc source
job
- #63404 [test](streaming-job) refine cdc data-type and boundary
regression cases for mysql/pg
- #63471 [regression-test](streaming-job) add cdc cases
(composite/concurrent-dml/id-gap/decimal/datetime pk) and fix
split-bound java.time deserialize
- #63480 [fix](streaming-job) misc fixes for
typo/log/validation/visibility
- #63402 [fix](streaming-job) drop neighbour-table rows leaked by JDBC
LIKE wildcards in JdbcPostgreSQLClient
- #63514 [regression-test](streaming-job) add cdc operational cases for
offset modes and pg slot lifecycle
- #63618 [fix](streaming-job) fix postgres historical-date timestamp
handling in cdc-client
- #63490 [improve](streaming-job) support user-specified mysql server_id
with per-reader assignment
zhaorongsheng pushed a commit to zhaorongsheng/doris that referenced this pull request Jun 4, 2026
…ldcards in JdbcPostgreSQLClient (apache#63402)

### What problem does this PR solve?

`JdbcPostgreSQLClient.getJdbcColumnsInfo` calls
`DatabaseMetaData.getColumns(catalog, schemaPattern, tableNamePattern,
columnNamePattern)`. Per the JDBC spec the 3rd argument is a **SQL LIKE
pattern**, so literal `_` / `%` characters in the requested table name
are interpreted as wildcards by the Postgres driver. When a streaming
job is created with `include_tables = "user_info_pg_normal1"` and a
neighbour table like `userXinfo_pg_normal1` happens to coexist in the
same schema, the metadata query returns columns from **both** tables.
The combined result then trips `CREATE TABLE` on the Doris side with
errors such as `errCode = 2, detailMessage = Duplicate column name
'name'`, or pollutes the auto-created table schema with stray columns.

The repro is trivial: in the same Postgres schema create

- `user_info_pg_normal1(name varchar, age int2)` — the table we want to
capture
- `userXinfo_pg_normal1(name varchar, weight float8)` — a decoy whose
name only differs from the target by a single character that `_` matches

then run `CREATE JOB ... include_tables = "user_info_pg_normal1"`.
Without the fix the schema fetched for the target leaks `weight` (or
`Duplicate column name 'name'`, depending on column order).

Fix: after fetching the `ResultSet`, drop rows whose `TABLE_NAME` does
not exactly equal the requested `remoteTableName`. We deliberately do
**not** escape `_` / `%` at the source — relying on
`DatabaseMetaData.getSearchStringEscape()` is driver-version dependent
(older Oracle drivers don't honour escape sequences in `getTables`),
while filtering on the consumer side is deterministic and
driver-agnostic.

Scope:

- Only `JdbcPostgreSQLClient` is patched. This is the path used by
Postgres streaming jobs (the failing case). MySQL streaming jobs were
checked against the same decoy pattern and do not reproduce the bug
because MySQL Connector/J doesn't pull neighbour rows here in practice —
so `JdbcMySQLClient` is left untouched in this PR.
- The JDBC catalog path lives in a separate module
(`fe-connector-jdbc/.../JdbcConnectorClient`) and is **not** part of
this PR. It already does partial escape but intentionally skips `_` /
`%` for driver-compatibility reasons; a follow-up can apply the same
after-the-fact filter there.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants