KAFKA-9693: Kafka latency spikes caused by log segment flush on roll#13782
KAFKA-9693: Kafka latency spikes caused by log segment flush on roll#13782novosibman wants to merge 11 commits intoapache:trunkfrom
Conversation
…y log segment flush on roll - trunk version
Hangleton
left a comment
There was a problem hiding this comment.
Many thanks for the patch and the collected data! Really interesting to see the impact of this change. A few questions:
- What storage device and file system are used in the test?
- Would you have a real-life workload where the impact of this change can be quantified? The workload generated by the producer-perf-test.sh exhibits the problem the most because the segments of all replicas on the brokers start rolling at the same time. Which is why it is also interesting to assess the impact using topic-partitions which have different ingress rate and/or use segments of different sizes.
In AWS config used: i3en.2xlarge with 2 x 2500 NVMe SSDs The FS format had huge impact on results. Initially we used ext4 in our lab for regular testing:
We have no any real-life workload scenarios available for Kafka perf testing. Alternative workload https://github.com/AzulSystems/kafka-benchmark has slightly different rolling behavior compared to OMB: OMB results example on released kafka_2.13-3.4.0 version (using xfs): ^^^ same params used: acks=1 batchSize=1048510 consumers=4 lingerMs=1 mlen=1024 partitions=100 producers=4 rf=1 targetRate=200k time=30m topics=1 |
| // we manually override the state offset here prior to taking the snapshot. | ||
| producerStateManager.updateMapEndOffset(newSegment.baseOffset) | ||
| producerStateManager.takeSnapshot() | ||
| producerStateManager.takeSnapshot(scheduler) |
There was a problem hiding this comment.
I am wondering if we need to increase the default size of background threads since we are adding more responsibility to it. Thoughts?
|
Thanks for the reply. In the tests we conducted in KAFKA-9693, an nvme SSD (one log directory) and ext4 were used along with jbd2, which likely penalized performance. Are all the graphs shared for OMB and Kafka Tussle generated for Kafka with the fix in this PR? |
Graphs with the fix noted in first description comment - marked with Other graphs in latter comment are examples of how rolling affects results on different configurations and benchmarks using regular Kafka release. |
… updated change according to feedbak
|
Provided updated change: |
divijvaidya
left a comment
There was a problem hiding this comment.
Thank you for making the changes. The changes look good and I left some minor comments.
Since we are adding new responsibility to the scheduler threads, I think that we should probably advise the users to reconfigure the number of background thread [2]. Could you please add a note to the upgrade section [1] about considering an increase in background threads.
[1]
Line 24 in 6678f1b
[2] https://kafka.apache.org/documentation.html#brokerconfigs_background.threads
showuon
left a comment
There was a problem hiding this comment.
Thanks for the PR. LGTM, left a minor comment.
|
Also, there are compiling error, please fix it. Thanks. |
… style check error corrected, open/close operations reduced for scheduler == null case
|
Open/close changes provided. |
|
Hey @novosibman could you please respond to rest of the comments at #13782 (review) and #13782 (comment) |
showuon
left a comment
There was a problem hiding this comment.
LGTM! Thanks for the patch!
Some minor comments:
- The PR title please start with
KAFKA-9693, ex:KAFKA-9693: Kafka latency spikes caused by log segment flush on roll - Notable change for v3.6 please remember to update. If you have any problem please let us know. Of course it can be a follow-up PR if you want. You can refer to this PR change
- @divijvaidya , do you think we should backport this patch to 3.5 branch? This should have been existed for a long time, and should not be a regression bug. Maybe no? Thoughts?
|
@showuon - with this change we don't have consistent data on different flushed files on disk (since earlier they were flushed together but now it's done async). I want to ensure that this inconsistent is ok and recovery will not get hampered by it. Please wait for my review before merging this. |
|
This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please ask a committer for review. If the PR has merge conflicts, please update it with the latest from trunk (or appropriate release branch) If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed. |
|
@novosibman @ocadaruma do we still need this change after https://github.com/apache/kafka/pull/14242/files? Asking because with the latter PR merged in, we are not blocking request handler thread while flushing producer snapshot. This is same was what this PR is trying to achieve. Hence, I think this could be closed. |
|
@divijvaidya Yeah, my understanding is the same |
|
Thanks for checking @ocadaruma . I am going to close this PR, please feel free to re-open if you think that this is still not fixed. |




Trunk version of initial change: #13768 in branch "3.4"
Key difference with branched change:
Passed and used existing
schedulerwhich already is being used for flushing large segment logs and indices.In all cases snapshot's fileChannel is kept opened when passed to other threads for flushing and closing (so removing try-with-resource in this change).
Related issue https://issues.apache.org/jira/browse/KAFKA-9693
The issue with repeating latency spikes during Kafka log segments rolling still reproduced on the latest versions including kafka_2.13-3.4.0.
It was found that flushing Kafka snapshot file during segments rolling blocks producer request handling thread for some time. Reproduced latency improvement in the kafka_2.13-3.6.0-snapshot by offloading flush operation. Used available on my side single node test configuration:
kafka_2.13-3.6.0-snapshot - trunk version
kafka_2.13-3.6.0-snapshot-fix - trunk version with provided change
Test time increased to 1hr.
partitions=10 # rolling at each ~52 seconds

partitions=100 # rolling events about each 8.5 minute:
