[fix](be) Avoid local runtime filter merge deadlock#64866
Conversation
### What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: Local runtime filter merge can deadlock when one join build instance publishes a local-merge runtime filter while another instance sends its runtime filter size. The old local merge context lock protected both the merger and the producer list, so one path could hold a producer runtime filter lock and then wait for the context lock while another path held the context lock and then waited for a producer lock.
This change gives RuntimeFilterMerger its own internal synchronization and makes LocalMergeContext expose a snapshot of the merger and producers. Publish, send-size, and sync-size paths take the context lock only while copying that snapshot, then merge filters or update producer sizes outside the context lock. RuntimeFilterMerger returns the ready transition from merge_from directly, removing the separate unlocked ready check.
### Release note
None
### Check List (For Author)
- Test: Unit Test
- build-support/clang-format.sh be/src/exec/runtime_filter/runtime_filter_merger.h be/src/exec/runtime_filter/runtime_filter_mgr.cpp be/src/exec/runtime_filter/runtime_filter_mgr.h be/src/exec/runtime_filter/runtime_filter_producer.cpp be/test/exec/runtime_filter/runtime_filter_merger_test.cpp be/test/exec/runtime_filter/runtime_filter_mgr_test.cpp
- git diff --cached --check
- ./run-be-ut.sh --run --filter=RuntimeFilterMgrTest.*
- ./run-be-ut.sh --run --filter=RuntimeFilterMergerTest.*
- Behavior changed: No
- Does this need documentation: No
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
|
run buildall |
There was a problem hiding this comment.
Automated review summary for PR #64866
I reviewed the runtime-filter merger/producer/manager changes and the adjacent runtime-filter tests. I did not find a substantiated issue that should be raised as an inline review comment.
Critical checkpoints:
- Goal/test: the change targets the local runtime-filter merge deadlock by moving producer/merger work out of
LocalMergeContextlocking and synchronizingRuntimeFilterMergerinternally. The updated BE UTs cover the adjusted merger readiness and manager API behavior. I could not run BE UTs locally becausethirdparty/installedis missing in this checkout. - Scope/clarity: the patch is focused on runtime-filter locking, snapshots, and related unit-test updates.
- Concurrency/lifecycle: I checked producer
publish()/send_size(), local snapshots, merger locking, global merge reset, and recursive CTE reset/rebuild sequencing. I did not find a new lock-order cycle or a substantiated stale-stage publication issue. - Config/compatibility: no new config was added; the runtime-filter stage fields used on RPCs are existing optional protobuf fields.
- Parallel paths: local-only, local-merge, remote merge, size-sync, stale recursive-CTE RPC handling, and local consumer signaling paths were reviewed.
- Tests/style:
git diff --checkpassed, andbuild-support/check-format.shpassed whenclang-format16 was selected inPATH. BE UT execution was not possible locally because the checkout lacksthirdparty/installed.
Subagent conclusions:
optimizer-rewriteproposed OPT-1 about stale recursive-CTE local publication. I dismissed it with lifecycle evidence:WAIT_FOR_DESTROYcompletes old PFC teardown beforeREBUILDregisters new-stage consumers/producers, while remote stale messages are stage-checked.tests-session-configreported no candidates.- Convergence round 1 ended with both live subagents reporting
NO_NEW_VALUABLE_FINDINGSfor the final no-inline-comment set.
User focus: no additional user-provided review focus was supplied.
TPC-H: Total hot run time: 28845 ms |
TPC-DS: Total hot run time: 173691 ms |
ClickBench: Total hot run time: 25.24 s |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
|
/review |
There was a problem hiding this comment.
Automated review summary for PR #64866.
I reviewed the six changed runtime-filter files and the relevant producer/helper, manager, pipeline prepare/submit, recursive CTE rerun, and sync/publish paths. I did not find a substantiated issue requiring an inline review comment.
Critical checkpoint conclusions:
- Goal: the PR targets a local runtime filter merge deadlock by moving merge/size work out from the old context lock and giving RuntimeFilterMerger its own synchronization. The changed
merge_from(..., &ready), size aggregation, shared context lookup, and producer publish/send-size call sites are consistent with that goal. - Scope and focus: the change is confined to BE runtime-filter manager/producer/merger code and matching BE unit-test API updates.
- Concurrency/lifecycle: I specifically checked producer registration, send-size, publish, global merge, and recursive CTE stage reset ordering. Producer registration happens during task prepare before fragment submit; parallel prepare waits for all instances; recursive CTE runs WAIT_FOR_DESTROY -> reset_global_rf -> REBUILD/prepare -> SUBMIT; stale-stage requests are filtered through stage checks. I did not find an executable same-stage registration/read race in the reviewed paths.
- Configuration, protocol, storage, persistence: no new config, thrift/protobuf field, storage format, or persistence behavior is introduced.
- Parallel paths: local merge publish, local size sync, and global merge-controller merge paths were updated to use the new ready transition. I did not find a missed parallel runtime-filter path in the changed surface.
- Tests: the PR updates BE unit tests for the new
merge_fromready result and local merge context API. I could not run BE UT in this checkout becausethirdparty/installedis absent. I did rungit diff --checkon the exact PR range, and it passed. - Observability/security: no security-sensitive path is changed; debug output is preserved through the new
LocalMergeContext::debug_string()path. - User focus: no additional user-provided review focus was supplied.
Subagent conclusions:
optimizer-rewrite: no optimizer/rewrite or runtime-filter semantic regression found.tests-session-config: raised TEST-1 about unsynchronizedLocalMergeContext::producersreads. I dismissed it with lifecycle evidence above; it remains a useful concurrent-regression test gap but not a substantiated inline issue.- Convergence round 1 ended with both live subagents reporting
NO_NEW_VALUABLE_FINDINGSfor the same ledger and empty proposed inline comment set.
TPC-H: Total hot run time: 29309 ms |
TPC-DS: Total hot run time: 171988 ms |
ClickBench: Total hot run time: 25.22 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
TPC-H: Total hot run time: 29536 ms |
TPC-DS: Total hot run time: 171619 ms |
ClickBench: Total hot run time: 25.3 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
TPC-H: Total hot run time: 29130 ms |
TPC-DS: Total hot run time: 172406 ms |
ClickBench: Total hot run time: 25.26 s |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
1 similar comment
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: Local runtime filter merge can deadlock when one join build instance publishes a local-merge runtime filter while another instance sends its runtime filter size. The old local merge context lock protected both the merger and the producer list, so one path could hold a producer runtime filter lock and then wait for the context lock while another path held the context lock and then waited for a producer lock.
This change gives RuntimeFilterMerger its own internal synchronization and makes LocalMergeContext expose a snapshot of the merger and producers. Publish, send-size, and sync-size paths take the context lock only while copying that snapshot, then merge filters or update producer sizes outside the context lock. RuntimeFilterMerger returns the ready transition from merge_from directly, removing the separate unlocked ready check.
Release note
None
Check List (For Author)