Describe the bug (描述bug)
Hi, we are using brpc and bthread as our rpc framework and runtime. Our tasks are lightweight, the workload is like handling a request and read some data in memory, usally finished in ~10ms.
Under 96Core CPU, we config out bthread worker 106. And when the workload is lots of lightweight request (about 300K request per second), pprof shows that do_futex takes about 25% of CPU runtime. And bthread_worker_usage is only 15-20, bthread_signal_second is also high, bthread_count is about 1600, and our server qps is 300k. Some information in pprof can be listed as follow:
bthread::TaskGroup::end_sched 18.09%
- steal_task 5.61%
- sched_to 12.07%
- ready_to_run 11.85%
- do_futex 11.16% (call futex_wake)
- _raw_spin_unlock_irqrestore 10.42%
and:
bthread::TaskGroup::run_main_task 14.4%
- TaskGroup::wait_task 4.87%
- steal_task 4.63%
- futex_wait 8.42%
According to link, _raw_spin_unlock_irqrestore because interrupt is off. But it still takes too much time handling this than we expected on scheduling.
We guess that we produce too many lightweight bthread, and scheduling them will notify lots of TaskGroup workers. After changing bthread_worker to 60 and restart the server, the cost of scheduling reduce a lot. But restarting all machines is troblesome for us. And we think that configing worker number as same as hardware_concurrency is suitable for all different kinds of workloads.
How can we handling this problem? I found bthread can only add_worker dynamically, but cannot remove spare worker, which can solve this problem easily. Using a bthread pool may help to reducing the signal and bthread scheduling, but writing a ThreadPool over Fiber is really a dirty work.
To Reproduce (复现方法)
Expected behavior (期望行为)
The bthread can reduce worker, or spend less time on do_futex when there are many lightweight tasks.
Versions (各种版本)
OS: Linux 5.4
Compiler: g++ 830
brpc: 0.9.6
protobuf: We use thrift 0.9
Additional context/screenshots (更多上下文/截图)
Describe the bug (描述bug)
Hi, we are using brpc and bthread as our rpc framework and runtime. Our tasks are lightweight, the workload is like handling a request and read some data in memory, usally finished in ~10ms.
Under 96Core CPU, we config out bthread worker
106. And when the workload is lots of lightweight request (about 300K request per second), pprof shows thatdo_futextakes about 25% of CPU runtime. Andbthread_worker_usageis only 15-20,bthread_signal_secondis also high,bthread_countis about 1600, and our server qps is 300k. Some information in pprof can be listed as follow:and:
According to link,
_raw_spin_unlock_irqrestorebecause interrupt is off. But it still takes too much time handling this than we expected on scheduling.We guess that we produce too many lightweight bthread, and scheduling them will notify lots of TaskGroup workers. After changing
bthread_workerto 60 and restart the server, the cost of scheduling reduce a lot. But restarting all machines is troblesome for us. And we think that configing worker number as same ashardware_concurrencyis suitable for all different kinds of workloads.How can we handling this problem? I found bthread can only
add_workerdynamically, but cannot remove spare worker, which can solve this problem easily. Using abthread poolmay help to reducing the signal and bthread scheduling, but writing a ThreadPool over Fiber is really a dirty work.To Reproduce (复现方法)
Expected behavior (期望行为)
The bthread can reduce worker, or spend less time on
do_futexwhen there are many lightweight tasks.Versions (各种版本)
OS: Linux 5.4
Compiler: g++ 830
brpc: 0.9.6
protobuf: We use thrift 0.9
Additional context/screenshots (更多上下文/截图)