Suggest for performance fix: KAFKA-9693 Kafka latency spikes caused by log segment flush on roll#13768
Suggest for performance fix: KAFKA-9693 Kafka latency spikes caused by log segment flush on roll#13768novosibman wants to merge 1 commit intoapache:3.4from
Conversation
|
We typically make changes to master first. Would you be willing to submit a PR for that instead? |
|
Thanks for the PR! This looks promising. As Ismael said, let's share in trunk first. |
|
@novosibman What is the file system used in your test? OMB default should be |
Yes, used Reference test configuration used: |
Prepared and tested trunk version: #13782 |
|
This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please ask a committer for review. If the PR has merge conflicts, please update it with the latest from trunk (or appropriate release branch) If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed. |
|
This PR is fixed in trunk (scheduled for release in 3.7.0). Currently there are no plans of backporting this to earlier versions since this is a performance optimization and not a critical bug fix. I am closing this PR, please feel free to reopen if you think we still need to backport this. |

Related issue https://issues.apache.org/jira/browse/KAFKA-9693
The issue with repeating latency spikes during Kafka log segments rolling still reproduced on the latest versions including kafka_2.13-3.4.0.
It was found that flushing Kafka snapshot file during segments rolling blocks producer request handling thread for some time:
https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/log/ProducerStateManager.scala#L452
More partitions - more cumulative latency effect observed.
Suggested fix offloads flush (fileChannel.force) operation to the background thread similar to (but not exactly) how it was done in the UnifiedLog.scala:
The benchmarking using this fix shows significant reduction in repeating latency spikes:
test config:
AWS
3 node cluster (i3en.2xlarge)
zulu11.62.17-ca-jdk11.0.18-linux_x64, heap 6G per broker
1 loadgen (m5n.8xlarge) - OpenMessaging benchmark (OMB)
1 zookeeper (t2.small)
acks=all batchSize=1048510 consumers=4 insyncReplicas=2 lingerMs=1 mlen=1024 producers=4 rf=3 subscriptions=1 targetRate=200k time=12m topics=1 warmup=1m
variation 1:
partitions=10
latency score improved up to 10x times in high percentiles ^^^, spikes almost invisible
variation 2:
partitions=100
latency score improved up to 25x times in high percentiles ^^^
The fix was done for 3.4 branch - scala version of ProducerStateManager. Trunk needs corresponding fix for ProducerStateManager.java.