Enhancement
Under high concurrency we observe the runtime filter worker queue backing up considerably — we've seen runtime_filter_event_queue_len exceed 1.1M events and runtime_filter_bytes_in_queue exceed 60 GiB on a single CN, growing unboundedly until we intervene.
The drain is single-threaded (RuntimeFilterWorker::execute, one std::thread) and the hot event type RECEIVE_TOTAL_RF is dominated by synchronous RPC fan-out. _receive_total_runtime_filter (be/src/runtime/runtime_filter_worker.cpp:957) issues forward RPCs for the broadcast tree, then BatchClosuresJoinAndClean's destructor (line 533) joins every closure via bthread_id_join before the function returns — so the drain thread cannot pull the next event until every forward completes. Stack sample from a stalled worker:
#1 bthread::wait_pthread
#2 bthread::butex_wait
#3 bthread_id_join
#4 BatchClosuresJoinAndClean::~BatchClosuresJoinAndClean
#5 RuntimeFilterWorker::_receive_total_runtime_filter
#6 RuntimeFilterWorker::execute()
The thread was 85% in S state with 23.9M voluntary context switches while 60+ GiB of events sat queued. transmit_runtime_filter_concurrency was 0 during the stall — outbound RPCs complete quickly (p99 = 2.6 ms) but the worker spends its time parked in bthread_id_join between events rather than draining the queue.
There is no lever to parallelize drain or to make the forward step asynchronous. runtime_filter_queue_limit caps queue size but does not address throughput. Our only effective mitigation has been set global enable_global_runtime_filter = false, which we'd like to avoid.
Enhancement
Under high concurrency we observe the runtime filter worker queue backing up considerably — we've seen runtime_filter_event_queue_len exceed 1.1M events and runtime_filter_bytes_in_queue exceed 60 GiB on a single CN, growing unboundedly until we intervene.
The drain is single-threaded (RuntimeFilterWorker::execute, one std::thread) and the hot event type RECEIVE_TOTAL_RF is dominated by synchronous RPC fan-out. _receive_total_runtime_filter (be/src/runtime/runtime_filter_worker.cpp:957) issues forward RPCs for the broadcast tree, then BatchClosuresJoinAndClean's destructor (line 533) joins every closure via bthread_id_join before the function returns — so the drain thread cannot pull the next event until every forward completes. Stack sample from a stalled worker:
The thread was 85% in S state with 23.9M voluntary context switches while 60+ GiB of events sat queued. transmit_runtime_filter_concurrency was 0 during the stall — outbound RPCs complete quickly (p99 = 2.6 ms) but the worker spends its time parked in bthread_id_join between events rather than draining the queue.
There is no lever to parallelize drain or to make the forward step asynchronous. runtime_filter_queue_limit caps queue size but does not address throughput. Our only effective mitigation has been set global enable_global_runtime_filter = false, which we'd like to avoid.