Skip to content

[fix](be) Avoid local runtime filter merge deadlock#64866

Open
BiteTheDDDDt wants to merge 4 commits into
apache:masterfrom
BiteTheDDDDt:codex/rf-local-merge-locks
Open

[fix](be) Avoid local runtime filter merge deadlock#64866
BiteTheDDDDt wants to merge 4 commits into
apache:masterfrom
BiteTheDDDDt:codex/rf-local-merge-locks

Conversation

@BiteTheDDDDt

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Local runtime filter merge can deadlock when one join build instance publishes a local-merge runtime filter while another instance sends its runtime filter size. The old local merge context lock protected both the merger and the producer list, so one path could hold a producer runtime filter lock and then wait for the context lock while another path held the context lock and then waited for a producer lock.

This change gives RuntimeFilterMerger its own internal synchronization and makes LocalMergeContext expose a snapshot of the merger and producers. Publish, send-size, and sync-size paths take the context lock only while copying that snapshot, then merge filters or update producer sizes outside the context lock. RuntimeFilterMerger returns the ready transition from merge_from directly, removing the separate unlocked ready check.

Release note

None

Check List (For Author)

  • Test: Unit Test
    • build-support/clang-format.sh be/src/exec/runtime_filter/runtime_filter_merger.h be/src/exec/runtime_filter/runtime_filter_mgr.cpp be/src/exec/runtime_filter/runtime_filter_mgr.h be/src/exec/runtime_filter/runtime_filter_producer.cpp be/test/exec/runtime_filter/runtime_filter_merger_test.cpp be/test/exec/runtime_filter/runtime_filter_mgr_test.cpp
    • git diff --cached --check
    • ./run-be-ut.sh --run --filter=RuntimeFilterMgrTest.*
    • ./run-be-ut.sh --run --filter=RuntimeFilterMergerTest.*
  • Behavior changed: No
  • Does this need documentation: No

### What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary: Local runtime filter merge can deadlock when one join build instance publishes a local-merge runtime filter while another instance sends its runtime filter size. The old local merge context lock protected both the merger and the producer list, so one path could hold a producer runtime filter lock and then wait for the context lock while another path held the context lock and then waited for a producer lock.

This change gives RuntimeFilterMerger its own internal synchronization and makes LocalMergeContext expose a snapshot of the merger and producers. Publish, send-size, and sync-size paths take the context lock only while copying that snapshot, then merge filters or update producer sizes outside the context lock. RuntimeFilterMerger returns the ready transition from merge_from directly, removing the separate unlocked ready check.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - build-support/clang-format.sh be/src/exec/runtime_filter/runtime_filter_merger.h be/src/exec/runtime_filter/runtime_filter_mgr.cpp be/src/exec/runtime_filter/runtime_filter_mgr.h be/src/exec/runtime_filter/runtime_filter_producer.cpp be/test/exec/runtime_filter/runtime_filter_merger_test.cpp be/test/exec/runtime_filter/runtime_filter_mgr_test.cpp
    - git diff --cached --check
    - ./run-be-ut.sh --run --filter=RuntimeFilterMgrTest.*
    - ./run-be-ut.sh --run --filter=RuntimeFilterMergerTest.*
- Behavior changed: No
- Does this need documentation: No
@hello-stephen

Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@BiteTheDDDDt BiteTheDDDDt marked this pull request as ready for review June 25, 2026 15:06
@BiteTheDDDDt

Copy link
Copy Markdown
Contributor Author

/review

@BiteTheDDDDt

Copy link
Copy Markdown
Contributor Author

run buildall

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review summary for PR #64866

I reviewed the runtime-filter merger/producer/manager changes and the adjacent runtime-filter tests. I did not find a substantiated issue that should be raised as an inline review comment.

Critical checkpoints:

  • Goal/test: the change targets the local runtime-filter merge deadlock by moving producer/merger work out of LocalMergeContext locking and synchronizing RuntimeFilterMerger internally. The updated BE UTs cover the adjusted merger readiness and manager API behavior. I could not run BE UTs locally because thirdparty/installed is missing in this checkout.
  • Scope/clarity: the patch is focused on runtime-filter locking, snapshots, and related unit-test updates.
  • Concurrency/lifecycle: I checked producer publish()/send_size(), local snapshots, merger locking, global merge reset, and recursive CTE reset/rebuild sequencing. I did not find a new lock-order cycle or a substantiated stale-stage publication issue.
  • Config/compatibility: no new config was added; the runtime-filter stage fields used on RPCs are existing optional protobuf fields.
  • Parallel paths: local-only, local-merge, remote merge, size-sync, stale recursive-CTE RPC handling, and local consumer signaling paths were reviewed.
  • Tests/style: git diff --check passed, and build-support/check-format.sh passed when clang-format 16 was selected in PATH. BE UT execution was not possible locally because the checkout lacks thirdparty/installed.

Subagent conclusions:

  • optimizer-rewrite proposed OPT-1 about stale recursive-CTE local publication. I dismissed it with lifecycle evidence: WAIT_FOR_DESTROY completes old PFC teardown before REBUILD registers new-stage consumers/producers, while remote stale messages are stage-checked.
  • tests-session-config reported no candidates.
  • Convergence round 1 ended with both live subagents reporting NO_NEW_VALUABLE_FINDINGS for the final no-inline-comment set.

User focus: no additional user-provided review focus was supplied.

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 28845 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 931224834fd4f7e16e8e35d42aacf6f6f41f6015, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17689	4189	4003	4003
q2	2007	322	185	185
q3	10376	1391	830	830
q4	4681	469	340	340
q5	7486	866	582	582
q6	186	175	139	139
q7	769	847	625	625
q8	9427	1520	1519	1519
q9	5874	4510	4494	4494
q10	6741	1814	1541	1541
q11	438	270	241	241
q12	626	416	294	294
q13	18125	3447	2778	2778
q14	286	269	251	251
q15	q16	796	774	708	708
q17	1156	925	886	886
q18	6944	5743	5477	5477
q19	1239	1263	985	985
q20	506	397	263	263
q21	5738	2693	2400	2400
q22	439	359	304	304
Total cold run time: 101529 ms
Total hot run time: 28845 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4339	4267	4266	4266
q2	315	338	224	224
q3	4612	4944	4436	4436
q4	2071	2175	1375	1375
q5	4465	4323	4320	4320
q6	239	183	129	129
q7	1754	1643	1813	1643
q8	2550	2162	2185	2162
q9	8086	8126	8055	8055
q10	4814	4743	4294	4294
q11	599	418	387	387
q12	740	749	542	542
q13	3341	3479	2986	2986
q14	299	319	294	294
q15	q16	709	727	651	651
q17	1338	1326	1328	1326
q18	8060	7429	7156	7156
q19	1139	1108	1122	1108
q20	2291	2250	1948	1948
q21	5312	4593	4455	4455
q22	510	451	400	400
Total cold run time: 57583 ms
Total hot run time: 52157 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 173691 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 931224834fd4f7e16e8e35d42aacf6f6f41f6015, data reload: false

query5	4347	628	497	497
query6	437	186	183	183
query7	4821	559	310	310
query8	321	193	181	181
query9	8764	4137	4109	4109
query10	436	314	265	265
query11	5875	2364	2154	2154
query12	156	101	104	101
query13	1319	599	446	446
query14	6181	5400	5105	5105
query14_1	4429	4391	4361	4361
query15	216	208	175	175
query16	996	460	456	456
query17	1105	706	578	578
query18	2446	474	337	337
query19	200	184	145	145
query20	113	108	105	105
query21	214	147	123	123
query22	13650	13537	13376	13376
query23	17407	16607	16184	16184
query23_1	16298	16240	16213	16213
query24	7481	1779	1302	1302
query24_1	1340	1320	1322	1320
query25	550	443	372	372
query26	1332	320	174	174
query27	2714	556	351	351
query28	4484	2026	1981	1981
query29	1081	624	484	484
query30	322	243	199	199
query31	1112	1094	954	954
query32	101	61	58	58
query33	529	323	252	252
query34	1184	1159	649	649
query35	787	784	688	688
query36	1380	1405	1236	1236
query37	158	109	108	108
query38	1915	1720	1692	1692
query39	937	932	894	894
query39_1	870	878	888	878
query40	222	127	107	107
query41	86	70	70	70
query42	97	90	88	88
query43	331	343	296	296
query44	1428	814	799	799
query45	205	215	192	192
query46	1094	1236	758	758
query47	2364	2307	2231	2231
query48	398	420	288	288
query49	580	426	321	321
query50	986	345	265	265
query51	4589	4423	4384	4384
query52	87	84	72	72
query53	263	277	192	192
query54	283	234	219	219
query55	76	79	69	69
query56	243	236	245	236
query57	1416	1428	1329	1329
query58	256	228	226	226
query59	1657	1748	1509	1509
query60	305	262	246	246
query61	177	179	176	176
query62	696	656	592	592
query63	231	195	214	195
query64	2561	807	647	647
query65	4899	4806	4828	4806
query66	1821	523	331	331
query67	28828	28813	28741	28741
query68	3120	1511	974	974
query69	423	305	277	277
query70	1114	972	957	957
query71	288	238	213	213
query72	2902	2636	2302	2302
query73	862	765	422	422
query74	5098	4952	4754	4754
query75	2566	2553	2209	2209
query76	2318	1237	809	809
query77	354	396	293	293
query78	12489	12577	11756	11756
query79	1395	1254	797	797
query80	586	459	388	388
query81	452	279	242	242
query82	581	168	120	120
query83	360	295	265	265
query84	263	145	115	115
query85	846	552	417	417
query86	359	292	289	289
query87	1856	1863	1773	1773
query88	3728	2786	2790	2786
query89	427	388	348	348
query90	1951	185	188	185
query91	174	166	131	131
query92	60	62	56	56
query93	1603	1546	901	901
query94	557	359	318	318
query95	705	474	349	349
query96	1116	825	340	340
query97	2689	2700	2568	2568
query98	214	211	202	202
query99	1191	1149	1021	1021
Total cold run time: 257253 ms
Total hot run time: 173691 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.24 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 931224834fd4f7e16e8e35d42aacf6f6f41f6015, data reload: false

query1	0.01	0.01	0.00
query2	0.09	0.05	0.06
query3	0.25	0.14	0.14
query4	1.62	0.13	0.13
query5	0.24	0.23	0.23
query6	1.25	1.06	1.02
query7	0.03	0.00	0.01
query8	0.10	0.04	0.04
query9	0.38	0.31	0.32
query10	0.57	0.55	0.54
query11	0.18	0.15	0.15
query12	0.19	0.14	0.14
query13	0.48	0.47	0.48
query14	1.02	1.01	1.00
query15	0.61	0.59	0.60
query16	0.32	0.31	0.30
query17	1.10	1.11	1.12
query18	0.22	0.22	0.20
query19	1.99	1.94	1.98
query20	0.02	0.01	0.01
query21	15.42	0.23	0.13
query22	4.84	0.05	0.05
query23	16.14	0.32	0.12
query24	2.95	0.41	0.34
query25	0.12	0.05	0.05
query26	0.71	0.20	0.14
query27	0.04	0.04	0.03
query28	3.55	0.96	0.54
query29	12.44	4.32	3.46
query30	0.27	0.15	0.15
query31	2.77	0.60	0.32
query32	3.21	0.59	0.49
query33	3.24	3.23	3.32
query34	16.04	4.25	3.52
query35	3.52	3.56	3.53
query36	0.54	0.44	0.43
query37	0.09	0.07	0.06
query38	0.04	0.04	0.03
query39	0.04	0.03	0.03
query40	0.17	0.15	0.16
query41	0.10	0.03	0.02
query42	0.04	0.02	0.03
query43	0.04	0.04	0.03
Total cold run time: 96.99 s
Total hot run time: 25.24 s

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 95.92% (94/98) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.20% (28463/38361)
Line Coverage 58.05% (310076/534137)
Region Coverage 54.71% (258946/473277)
Branch Coverage 56.09% (112577/200695)

@BiteTheDDDDt

Copy link
Copy Markdown
Contributor Author

run buildall

@BiteTheDDDDt

Copy link
Copy Markdown
Contributor Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review summary for PR #64866.

I reviewed the six changed runtime-filter files and the relevant producer/helper, manager, pipeline prepare/submit, recursive CTE rerun, and sync/publish paths. I did not find a substantiated issue requiring an inline review comment.

Critical checkpoint conclusions:

  • Goal: the PR targets a local runtime filter merge deadlock by moving merge/size work out from the old context lock and giving RuntimeFilterMerger its own synchronization. The changed merge_from(..., &ready), size aggregation, shared context lookup, and producer publish/send-size call sites are consistent with that goal.
  • Scope and focus: the change is confined to BE runtime-filter manager/producer/merger code and matching BE unit-test API updates.
  • Concurrency/lifecycle: I specifically checked producer registration, send-size, publish, global merge, and recursive CTE stage reset ordering. Producer registration happens during task prepare before fragment submit; parallel prepare waits for all instances; recursive CTE runs WAIT_FOR_DESTROY -> reset_global_rf -> REBUILD/prepare -> SUBMIT; stale-stage requests are filtered through stage checks. I did not find an executable same-stage registration/read race in the reviewed paths.
  • Configuration, protocol, storage, persistence: no new config, thrift/protobuf field, storage format, or persistence behavior is introduced.
  • Parallel paths: local merge publish, local size sync, and global merge-controller merge paths were updated to use the new ready transition. I did not find a missed parallel runtime-filter path in the changed surface.
  • Tests: the PR updates BE unit tests for the new merge_from ready result and local merge context API. I could not run BE UT in this checkout because thirdparty/installed is absent. I did run git diff --check on the exact PR range, and it passed.
  • Observability/security: no security-sensitive path is changed; debug output is preserved through the new LocalMergeContext::debug_string() path.
  • User focus: no additional user-provided review focus was supplied.

Subagent conclusions:

  • optimizer-rewrite: no optimizer/rewrite or runtime-filter semantic regression found.
  • tests-session-config: raised TEST-1 about unsynchronized LocalMergeContext::producers reads. I dismissed it with lifecycle evidence above; it remains a useful concurrent-regression test gap but not a substantiated inline issue.
  • Convergence round 1 ended with both live subagents reporting NO_NEW_VALUABLE_FINDINGS for the same ledger and empty proposed inline comment set.

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29309 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit e41df92c573a1d85a6590ea49bff4d83ac4bdd9b, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17783	4154	4058	4058
q2	2007	329	190	190
q3	10306	1510	845	845
q4	4685	472	353	353
q5	7512	879	572	572
q6	183	172	136	136
q7	798	914	624	624
q8	9353	1521	1516	1516
q9	5597	4519	4554	4519
q10	6744	1795	1525	1525
q11	439	277	245	245
q12	628	416	301	301
q13	18099	3292	2829	2829
q14	261	258	245	245
q15	q16	795	772	717	717
q17	1019	1008	999	999
q18	7013	5739	5639	5639
q19	1192	1286	1013	1013
q20	506	413	277	277
q21	5668	2655	2399	2399
q22	435	360	307	307
Total cold run time: 101023 ms
Total hot run time: 29309 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4420	4330	4314	4314
q2	317	355	229	229
q3	4584	5040	4377	4377
q4	2379	2166	1384	1384
q5	4479	4339	4339	4339
q6	231	174	131	131
q7	1772	1663	2056	1663
q8	2625	2213	2214	2213
q9	8410	8513	8239	8239
q10	4808	4863	4304	4304
q11	568	411	375	375
q12	772	767	543	543
q13	3334	3522	2993	2993
q14	284	297	282	282
q15	q16	719	736	649	649
q17	1350	1339	1577	1339
q18	7952	7339	7302	7302
q19	1191	1086	1135	1086
q20	2300	2236	1981	1981
q21	5325	4658	4571	4571
q22	511	468	391	391
Total cold run time: 58331 ms
Total hot run time: 52705 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 171988 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit e41df92c573a1d85a6590ea49bff4d83ac4bdd9b, data reload: false

query5	4321	636	516	516
query6	439	194	166	166
query7	4914	532	307	307
query8	324	183	162	162
query9	8747	4068	4104	4068
query10	417	331	264	264
query11	5860	2359	2166	2166
query12	161	102	100	100
query13	1250	642	440	440
query14	6244	5337	5010	5010
query14_1	4358	4314	4333	4314
query15	207	199	180	180
query16	994	436	417	417
query17	1100	702	550	550
query18	2423	466	336	336
query19	196	180	150	150
query20	107	103	110	103
query21	211	138	117	117
query22	13813	13648	13493	13493
query23	17399	16507	16048	16048
query23_1	16211	16386	16268	16268
query24	7803	1759	1284	1284
query24_1	1309	1296	1293	1293
query25	536	447	371	371
query26	1328	328	176	176
query27	2632	565	352	352
query28	4439	2047	2030	2030
query29	1072	619	507	507
query30	311	229	205	205
query31	1131	1087	972	972
query32	114	65	63	63
query33	539	330	262	262
query34	1163	1202	656	656
query35	769	785	670	670
query36	1394	1460	1285	1285
query37	164	110	98	98
query38	1900	1723	1663	1663
query39	944	926	896	896
query39_1	878	876	868	868
query40	231	127	107	107
query41	70	75	68	68
query42	91	90	89	89
query43	326	332	283	283
query44	1450	806	807	806
query45	209	195	182	182
query46	1071	1226	762	762
query47	2349	2336	2198	2198
query48	412	441	291	291
query49	587	441	320	320
query50	969	352	263	263
query51	4406	4389	4379	4379
query52	84	85	72	72
query53	258	268	197	197
query54	283	232	214	214
query55	76	74	68	68
query56	274	246	238	238
query57	1455	1407	1304	1304
query58	247	231	221	221
query59	1589	1635	1468	1468
query60	287	264	291	264
query61	150	145	155	145
query62	697	641	582	582
query63	226	193	206	193
query64	2570	784	592	592
query65	4867	4786	4754	4754
query66	1825	446	342	342
query67	29029	28887	28749	28749
query68	3132	1590	1007	1007
query69	407	303	264	264
query70	1068	970	967	967
query71	300	232	211	211
query72	2937	2652	2386	2386
query73	840	777	426	426
query74	5120	4970	4797	4797
query75	2580	2548	2173	2173
query76	2305	1164	780	780
query77	365	405	284	284
query78	12463	12376	12067	12067
query79	1440	1207	807	807
query80	736	464	392	392
query81	478	275	237	237
query82	568	153	119	119
query83	345	286	256	256
query84	266	146	115	115
query85	888	513	413	413
query86	409	304	273	273
query87	1855	1833	1779	1779
query88	3768	2839	2799	2799
query89	427	387	344	344
query90	1788	175	179	175
query91	173	162	132	132
query92	62	61	54	54
query93	1586	1435	914	914
query94	618	349	338	338
query95	668	373	430	373
query96	1047	820	341	341
query97	2705	2662	2564	2564
query98	215	241	209	209
query99	1201	1151	1046	1046
Total cold run time: 257138 ms
Total hot run time: 171988 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.22 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit e41df92c573a1d85a6590ea49bff4d83ac4bdd9b, data reload: false

query1	0.00	0.00	0.01
query2	0.10	0.05	0.05
query3	0.25	0.13	0.12
query4	1.62	0.15	0.18
query5	0.24	0.22	0.22
query6	1.25	1.08	1.08
query7	0.04	0.01	0.00
query8	0.06	0.04	0.03
query9	0.38	0.36	0.31
query10	0.54	0.53	0.54
query11	0.19	0.15	0.15
query12	0.19	0.16	0.15
query13	0.48	0.47	0.49
query14	1.00	1.00	1.00
query15	0.61	0.59	0.61
query16	0.33	0.32	0.33
query17	1.10	1.11	1.11
query18	0.23	0.21	0.22
query19	1.97	1.88	2.01
query20	0.01	0.02	0.01
query21	15.51	0.24	0.13
query22	4.60	0.05	0.05
query23	16.13	0.30	0.13
query24	3.03	0.41	0.32
query25	0.10	0.05	0.06
query26	0.74	0.22	0.14
query27	0.05	0.03	0.04
query28	3.48	0.93	0.54
query29	12.56	4.24	3.47
query30	0.27	0.14	0.16
query31	2.78	0.62	0.31
query32	3.22	0.61	0.49
query33	3.21	3.21	3.30
query34	15.58	4.24	3.54
query35	3.56	3.52	3.53
query36	0.56	0.45	0.41
query37	0.08	0.06	0.06
query38	0.05	0.04	0.04
query39	0.03	0.03	0.03
query40	0.17	0.16	0.15
query41	0.09	0.03	0.02
query42	0.04	0.03	0.02
query43	0.04	0.04	0.04
Total cold run time: 96.47 s
Total hot run time: 25.22 s

Comment thread be/src/exec/runtime_filter/runtime_filter_mgr.cpp
Comment thread be/src/exec/runtime_filter/runtime_filter_mgr.cpp Outdated
@yiguolei yiguolei added usercase Important user case type label p0_b labels Jun 26, 2026
@hello-stephen

Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 59.76% (49/82) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 54.88% (21596/39350)
Line Coverage 38.40% (206506/537725)
Region Coverage 34.48% (162527/471367)
Branch Coverage 35.49% (71168/200537)

@BiteTheDDDDt

Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29536 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 7d55640cc36903ac0028744526361fccf067a6ec, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17627	4267	4249	4249
q2	2047	326	189	189
q3	10258	1484	865	865
q4	4683	471	335	335
q5	7514	848	579	579
q6	189	171	142	142
q7	816	866	625	625
q8	9366	1683	1599	1599
q9	5617	4528	4503	4503
q10	6786	1774	1515	1515
q11	438	283	247	247
q12	634	433	300	300
q13	18091	3539	2814	2814
q14	274	260	240	240
q15	q16	790	777	714	714
q17	1037	996	982	982
q18	7241	5857	5582	5582
q19	1320	1349	1134	1134
q20	484	415	266	266
q21	5931	2672	2354	2354
q22	453	375	302	302
Total cold run time: 101596 ms
Total hot run time: 29536 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4510	4440	4504	4440
q2	324	354	239	239
q3	4632	4980	4457	4457
q4	2091	2195	1394	1394
q5	4495	4401	4383	4383
q6	235	179	132	132
q7	1888	2143	1783	1783
q8	2721	2331	2399	2331
q9	8492	8182	8241	8182
q10	4827	4793	4319	4319
q11	604	418	394	394
q12	758	780	532	532
q13	3259	3765	2995	2995
q14	311	301	278	278
q15	q16	730	747	662	662
q17	1401	1361	1364	1361
q18	8194	7468	7366	7366
q19	1218	1108	1099	1099
q20	2224	2229	1956	1956
q21	5339	4726	4643	4643
q22	542	465	406	406
Total cold run time: 58795 ms
Total hot run time: 53352 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 171619 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 7d55640cc36903ac0028744526361fccf067a6ec, data reload: false

query5	4338	628	469	469
query6	446	191	170	170
query7	4817	554	318	318
query8	333	186	170	170
query9	8763	4015	4006	4006
query10	451	313	258	258
query11	5966	2356	2152	2152
query12	157	103	106	103
query13	1312	641	428	428
query14	6291	5218	4947	4947
query14_1	4265	4252	4273	4252
query15	211	197	179	179
query16	991	455	429	429
query17	946	677	566	566
query18	2427	464	330	330
query19	202	179	141	141
query20	111	110	103	103
query21	234	134	114	114
query22	13605	13598	13461	13461
query23	17351	16699	16148	16148
query23_1	16369	16270	16195	16195
query24	7581	1780	1300	1300
query24_1	1330	1339	1314	1314
query25	572	478	398	398
query26	1301	334	167	167
query27	2672	559	350	350
query28	4531	2050	2048	2048
query29	1109	625	500	500
query30	313	240	203	203
query31	1113	1080	958	958
query32	113	62	63	62
query33	536	336	274	274
query34	1188	1166	651	651
query35	795	786	670	670
query36	1375	1458	1254	1254
query37	159	111	95	95
query38	1900	1719	1680	1680
query39	936	920	890	890
query39_1	863	876	870	870
query40	230	127	104	104
query41	75	69	70	69
query42	92	88	87	87
query43	322	323	282	282
query44	1467	801	792	792
query45	215	198	179	179
query46	1130	1213	736	736
query47	2362	2355	2256	2256
query48	394	392	308	308
query49	633	439	334	334
query50	1000	367	268	268
query51	4402	4340	4391	4340
query52	85	84	73	73
query53	258	264	193	193
query54	287	237	214	214
query55	80	73	71	71
query56	265	237	238	237
query57	1465	1399	1331	1331
query58	257	229	221	221
query59	1590	1618	1428	1428
query60	337	254	239	239
query61	179	172	210	172
query62	685	651	589	589
query63	228	191	195	191
query64	2546	769	598	598
query65	4871	4797	4749	4749
query66	1787	468	344	344
query67	28978	28774	28646	28646
query68	3087	1592	973	973
query69	407	303	269	269
query70	1041	956	945	945
query71	295	238	206	206
query72	2946	2599	2380	2380
query73	849	778	420	420
query74	5130	4938	4719	4719
query75	2583	2570	2165	2165
query76	2319	1206	799	799
query77	350	368	276	276
query78	12409	12266	11793	11793
query79	1366	1142	780	780
query80	1302	466	415	415
query81	508	278	245	245
query82	569	153	122	122
query83	359	271	245	245
query84	305	148	115	115
query85	920	553	417	417
query86	416	301	301	301
query87	1864	1840	1785	1785
query88	3671	2771	2772	2771
query89	424	386	352	352
query90	1897	178	178	178
query91	174	157	132	132
query92	63	62	57	57
query93	1491	1454	896	896
query94	755	354	311	311
query95	695	480	367	367
query96	1075	798	360	360
query97	2686	2689	2555	2555
query98	215	227	196	196
query99	1207	1165	1054	1054
Total cold run time: 257899 ms
Total hot run time: 171619 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.3 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 7d55640cc36903ac0028744526361fccf067a6ec, data reload: false

query1	0.01	0.01	0.00
query2	0.10	0.05	0.06
query3	0.25	0.14	0.14
query4	1.61	0.14	0.17
query5	0.24	0.23	0.23
query6	1.28	1.08	1.01
query7	0.03	0.00	0.00
query8	0.06	0.04	0.04
query9	0.38	0.31	0.33
query10	0.56	0.56	0.56
query11	0.19	0.15	0.15
query12	0.18	0.15	0.14
query13	0.49	0.48	0.49
query14	1.02	1.03	1.02
query15	0.62	0.60	0.60
query16	0.34	0.34	0.34
query17	1.17	1.12	1.16
query18	0.22	0.21	0.20
query19	2.03	1.94	2.00
query20	0.02	0.01	0.01
query21	15.49	0.25	0.15
query22	4.74	0.04	0.05
query23	16.13	0.31	0.13
query24	3.05	0.40	0.32
query25	0.12	0.05	0.04
query26	0.75	0.20	0.14
query27	0.04	0.04	0.04
query28	3.51	0.89	0.55
query29	12.54	4.36	3.47
query30	0.26	0.15	0.16
query31	2.78	0.61	0.31
query32	3.21	0.62	0.49
query33	3.16	3.27	3.19
query34	15.47	4.28	3.52
query35	3.60	3.49	3.56
query36	0.56	0.43	0.43
query37	0.09	0.07	0.06
query38	0.06	0.04	0.04
query39	0.04	0.03	0.03
query40	0.18	0.17	0.15
query41	0.09	0.04	0.03
query42	0.04	0.03	0.02
query43	0.04	0.03	0.03
Total cold run time: 96.75 s
Total hot run time: 25.3 s

@hello-stephen

Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 58.14% (50/86) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 54.88% (21594/39350)
Line Coverage 38.37% (206329/537729)
Region Coverage 34.46% (162454/471380)
Branch Coverage 35.47% (71135/200543)

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 94.19% (81/86) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.18% (28492/38407)
Line Coverage 58.04% (310228/534530)
Region Coverage 54.89% (259888/473480)
Branch Coverage 56.18% (112811/200786)

@BiteTheDDDDt

Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-H: Total hot run time: 29130 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 7d55640cc36903ac0028744526361fccf067a6ec, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17927	4109	4002	4002
q2	2040	313	189	189
q3	10273	1433	815	815
q4	4689	470	343	343
q5	7576	837	577	577
q6	190	167	137	137
q7	769	839	609	609
q8	9676	1622	1579	1579
q9	5898	4445	4500	4445
q10	6774	1827	1485	1485
q11	445	274	244	244
q12	634	415	312	312
q13	18066	3378	2718	2718
q14	267	273	257	257
q15	q16	789	770	708	708
q17	1089	987	1015	987
q18	7094	5700	5605	5605
q19	1248	1174	1089	1089
q20	478	406	272	272
q21	5792	2518	2456	2456
q22	448	358	301	301
Total cold run time: 102162 ms
Total hot run time: 29130 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4362	4316	4308	4308
q2	325	345	227	227
q3	4508	4912	4426	4426
q4	2100	2160	1386	1386
q5	4475	4351	4325	4325
q6	239	180	135	135
q7	1751	1809	1739	1739
q8	2604	2282	2148	2148
q9	8136	8121	8188	8121
q10	4830	4769	4300	4300
q11	567	445	386	386
q12	743	754	631	631
q13	3484	3668	3008	3008
q14	300	313	283	283
q15	q16	716	754	633	633
q17	1368	1361	1324	1324
q18	8070	7338	6837	6837
q19	1137	1102	1100	1100
q20	2235	2234	1968	1968
q21	5334	4675	4539	4539
q22	524	458	408	408
Total cold run time: 57808 ms
Total hot run time: 52232 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
TPC-DS: Total hot run time: 172406 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 7d55640cc36903ac0028744526361fccf067a6ec, data reload: false

query5	4333	670	484	484
query6	441	199	166	166
query7	4984	600	311	311
query8	327	179	167	167
query9	8775	4077	4096	4077
query10	455	313	264	264
query11	5938	2341	2148	2148
query12	160	106	106	106
query13	1274	600	447	447
query14	6229	5358	4992	4992
query14_1	4292	4306	4296	4296
query15	209	207	181	181
query16	1056	463	447	447
query17	1149	755	587	587
query18	2516	487	343	343
query19	212	189	147	147
query20	118	115	111	111
query21	219	139	118	118
query22	13749	13569	13525	13525
query23	17473	16479	16148	16148
query23_1	16372	16276	16271	16271
query24	7487	1757	1266	1266
query24_1	1328	1297	1296	1296
query25	562	476	396	396
query26	1298	321	166	166
query27	2696	567	326	326
query28	4463	2033	2016	2016
query29	1093	631	503	503
query30	310	242	206	206
query31	1124	1066	945	945
query32	106	62	61	61
query33	527	327	259	259
query34	1165	1135	641	641
query35	786	813	688	688
query36	1403	1414	1232	1232
query37	158	108	108	108
query38	1889	1725	1678	1678
query39	952	926	894	894
query39_1	874	886	898	886
query40	227	164	97	97
query41	65	66	62	62
query42	94	87	86	86
query43	321	319	279	279
query44	1478	826	820	820
query45	205	200	181	181
query46	1148	1249	757	757
query47	2520	2428	2265	2265
query48	419	419	279	279
query49	581	421	305	305
query50	995	351	259	259
query51	4365	4391	4426	4391
query52	80	80	75	75
query53	254	261	196	196
query54	266	220	196	196
query55	73	73	66	66
query56	269	224	229	224
query57	1445	1429	1308	1308
query58	240	219	205	205
query59	1590	1635	1423	1423
query60	291	249	231	231
query61	148	156	154	154
query62	706	646	581	581
query63	233	193	194	193
query64	2519	769	622	622
query65	4851	4761	4799	4761
query66	1790	472	358	358
query67	28921	28787	28604	28604
query68	3131	1457	844	844
query69	404	308	264	264
query70	1152	929	987	929
query71	286	229	226	226
query72	2918	2639	2425	2425
query73	806	764	447	447
query74	5111	4971	4829	4829
query75	2570	2527	2199	2199
query76	2317	1186	769	769
query77	371	386	290	290
query78	12395	12580	11897	11897
query79	1413	1141	759	759
query80	604	466	390	390
query81	451	281	237	237
query82	568	154	121	121
query83	331	271	257	257
query84	271	145	114	114
query85	871	528	433	433
query86	366	299	282	282
query87	1851	1841	1764	1764
query88	3717	2811	2794	2794
query89	435	384	336	336
query90	1945	180	176	176
query91	170	159	133	133
query92	66	61	56	56
query93	1469	1633	936	936
query94	561	342	309	309
query95	681	466	353	353
query96	1063	835	362	362
query97	2713	2687	2547	2547
query98	217	213	208	208
query99	1174	1168	1064	1064
Total cold run time: 257392 ms
Total hot run time: 172406 ms

@hello-stephen

Copy link
Copy Markdown
Contributor
ClickBench: Total hot run time: 25.26 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 7d55640cc36903ac0028744526361fccf067a6ec, data reload: false

query1	0.00	0.00	0.00
query2	0.10	0.06	0.05
query3	0.25	0.14	0.14
query4	1.62	0.14	0.14
query5	0.25	0.28	0.22
query6	1.25	1.12	1.09
query7	0.03	0.01	0.00
query8	0.06	0.04	0.04
query9	0.37	0.31	0.32
query10	0.58	0.55	0.53
query11	0.19	0.14	0.14
query12	0.17	0.14	0.14
query13	0.46	0.47	0.48
query14	1.02	1.00	1.00
query15	0.62	0.59	0.60
query16	0.32	0.33	0.32
query17	1.14	1.12	1.10
query18	0.23	0.22	0.21
query19	1.99	1.96	1.97
query20	0.01	0.01	0.01
query21	15.43	0.23	0.13
query22	4.73	0.06	0.05
query23	16.15	0.32	0.12
query24	2.97	0.45	0.36
query25	0.12	0.05	0.06
query26	0.74	0.20	0.15
query27	0.05	0.04	0.04
query28	3.52	0.89	0.56
query29	12.50	4.24	3.43
query30	0.28	0.15	0.15
query31	2.77	0.60	0.31
query32	3.23	0.59	0.48
query33	3.21	3.20	3.15
query34	15.68	4.20	3.51
query35	3.51	3.49	3.56
query36	0.56	0.44	0.44
query37	0.08	0.06	0.06
query38	0.05	0.04	0.04
query39	0.03	0.03	0.03
query40	0.17	0.16	0.15
query41	0.09	0.03	0.03
query42	0.04	0.03	0.03
query43	0.04	0.04	0.04
Total cold run time: 96.61 s
Total hot run time: 25.26 s

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 94.19% (81/86) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.19% (28496/38407)
Line Coverage 58.07% (310415/534530)
Region Coverage 54.91% (260008/473480)
Branch Coverage 56.20% (112843/200786)

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 94.19% (81/86) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.17% (28487/38407)
Line Coverage 58.03% (310201/534530)
Region Coverage 54.88% (259833/473480)
Branch Coverage 56.15% (112734/200786)

1 similar comment
@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 94.19% (81/86) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.17% (28487/38407)
Line Coverage 58.03% (310201/534530)
Region Coverage 54.88% (259833/473480)
Branch Coverage 56.15% (112734/200786)

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 94.19% (81/86) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.17% (28488/38407)
Line Coverage 58.03% (310182/534530)
Region Coverage 54.82% (259581/473480)
Branch Coverage 56.13% (112701/200786)

@hello-stephen

Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 94.19% (81/86) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.17% (28488/38407)
Line Coverage 58.04% (310215/534530)
Region Coverage 54.82% (259541/473480)
Branch Coverage 56.14% (112720/200786)

@BiteTheDDDDt

Copy link
Copy Markdown
Contributor Author

/review

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review summary for PR #64866.

I found one issue: the changed unit tests do not cover the nonzero recursive-stage local merge path that the production change now relies on for context replacement and stale-stage filtering.

Critical checkpoint conclusions:

  • Goal/test: the production change addresses local runtime-filter merge synchronization and recursive-stage stale filtering. Merger readiness, manager API renaming, and default stage-0 behavior are covered, but nonzero-stage local merge replacement/stale filtering is not covered.
  • Scope/clarity: the patch is focused on BE runtime-filter manager/producer/merger code and matching BE unit-test API updates.
  • Concurrency/lifecycle: I checked local merge registration, publish, send-size, sync-size, global merge reset, PipelineFragmentContext prepare/submit ordering, and recursive CTE WAIT_FOR_DESTROY -> reset_global_rf -> REBUILD -> SUBMIT ordering. The producer-vector synchronization notes overlap the existing inline thread and are not submitted again.
  • Config/compatibility/persistence/storage: no new config, storage format, persistence, or FE-BE protocol field is introduced by this PR.
  • Parallel paths: local merge publish, local size sync, global merge, stale-stage RPC checks, hash join and set sink runtime-filter paths were reviewed.
  • Tests/style: exact-range git diff --check passed locally. I did not run BE UTs because this checkout lacks thirdparty/installed; local clang-format validation was also not run because only clang-format 18 is available and Doris requires v16. Current GitHub status contexts show BE UT and compile success on the head; the failed macOS BE UT check exits early because the runner has Java 25 instead of JDK 17.
  • Observability/security: no security-sensitive behavior is changed; debug output is preserved through LocalMergeContext::debug_string().

Subagent conclusions:

  • optimizer-rewrite: OPT-1 was treated as duplicate of the existing producer-vector synchronization thread; convergence round 1 returned NO_NEW_VALUABLE_FINDINGS.
  • tests-session-config: TSC-1 was merged as the same duplicate synchronization point; TSC-2 became the single inline test-coverage comment; convergence round 1 returned NO_NEW_VALUABLE_FINDINGS.

User focus: no additional user-provided review focus was supplied.

// (graceful skip for recursive CTE stage reset).
EXPECT_TRUE(global_runtime_filter_mgr
->get_local_merge_producer_filters(filter_id, &local_merge_filters)
->get_local_merge_context(filter_id, producer_filter->stage(), &context)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test still only covers the default stage-0 path: producer_filter->stage() is never changed, and the PSyncFilterSizeRequest below also keeps protobuf stage 0. The production change now relies on nonzero recursive stages to replace the local merge context and ignore stale-stage lookups/size syncs, while the recursive CTE regression has runtime_filter_mode=off, so it will not exercise this path. Please add a focused BE UT that registers a stage-0 producer, then a stage-1 producer for the same filter, verifies stage-0 get_local_merge_context/sync_filter_size are ignored, and verifies the current stage still syncs normally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dev/4.0.x dev/4.1.x p0_b usercase Important user case type label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants