Skip to content

Suggest for performance fix: KAFKA-9693 Kafka latency spikes caused by log segment flush on roll#13768

Closed
novosibman wants to merge 1 commit intoapache:3.4from
novosibman:3.4
Closed

Suggest for performance fix: KAFKA-9693 Kafka latency spikes caused by log segment flush on roll#13768
novosibman wants to merge 1 commit intoapache:3.4from
novosibman:3.4

Conversation

@novosibman
Copy link
Copy Markdown

Related issue https://issues.apache.org/jira/browse/KAFKA-9693

The issue with repeating latency spikes during Kafka log segments rolling still reproduced on the latest versions including kafka_2.13-3.4.0.

It was found that flushing Kafka snapshot file during segments rolling blocks producer request handling thread for some time:
https://github.com/apache/kafka/blob/3.4/core/src/main/scala/kafka/log/ProducerStateManager.scala#L452

  private def writeSnapshot(file: File, entries: mutable.Map[Long, ProducerStateEntry]): Unit = {
...
    val fileChannel = FileChannel.open(file.toPath, StandardOpenOption.CREATE, StandardOpenOption.WRITE)
    try {
      fileChannel.write(buffer)
      fileChannel.force(true)     <- here
    } finally {
      fileChannel.close()
    }...

More partitions - more cumulative latency effect observed.

Suggested fix offloads flush (fileChannel.force) operation to the background thread similar to (but not exactly) how it was done in the UnifiedLog.scala:

  def roll(
   ...
    // Schedule an asynchronous flush of the old segment
    scheduler.schedule("flush-log", () => flushUptoOffsetExclusive(newSegment.baseOffset))
  }

The benchmarking using this fix shows significant reduction in repeating latency spikes:

test config:
AWS
3 node cluster (i3en.2xlarge)
zulu11.62.17-ca-jdk11.0.18-linux_x64, heap 6G per broker
1 loadgen (m5n.8xlarge) - OpenMessaging benchmark (OMB)
1 zookeeper (t2.small)
acks=all batchSize=1048510 consumers=4 insyncReplicas=2 lingerMs=1 mlen=1024 producers=4 rf=3 subscriptions=1 targetRate=200k time=12m topics=1 warmup=1m

variation 1:

partitions=10

metric kafka_2.13-3.4.0 kafka_2.13-3.4.0 patched
endToEnd service_time (ms) p50 max 2.00 2.00
endToEnd service_time (ms) p75 max 3.00 2.00
endToEnd service_time (ms) p95 max 94.0 3.00
endToEnd service_time (ms) p99 max 290 6.00
endToEnd service_time (ms) p99.9 max 355 21.0
endToEnd service_time (ms) p99.99 max 372 34.0
endToEnd service_time (ms) p100 max 374 36.0
publish service_time (ms) p50 max 1.70 1.67
publish service_time (ms) p75 max 2.23 2.09
publish service_time (ms) p95 max 90.7 2.82
publish service_time (ms) p99 max 287 4.69
publish service_time (ms) p99.9 max 353 19.6
publish service_time (ms) p99.99 max 369 31.3
publish service_time (ms) p100 max 371 33.5
kafka endToEnd chart
kafka_2.13-3.4.0 image
kafka_2.13-3.4.0 patched image

latency score improved up to 10x times in high percentiles ^^^, spikes almost invisible

variation 2:

partitions=100

metric kafka_2.13-3.4.0 kafka_2.13-3.4.0 patched
endToEnd service_time (ms) p50 max 91.0 2.00
endToEnd service_time (ms) p75 max 358 3.00
endToEnd service_time (ms) p95 max 1814 4.00
endToEnd service_time (ms) p99 max 2777 21.0
endToEnd service_time (ms) p99.9 max 3643 119
endToEnd service_time (ms) p99.99 max 3724 141
endToEnd service_time (ms) p100 max 3726 143
publish service_time (ms) p50 max 77.4 1.92
publish service_time (ms) p75 max 352 2.35
publish service_time (ms) p95 max 1748 3.80
publish service_time (ms) p99 max 2740 18.9
publish service_time (ms) p99.9 max 3619 116
publish service_time (ms) p99.99 max 3720 139
publish service_time (ms) p100 max 3722 141
endToEnd service_time
kafka endToEnd chart
kafka_2.13-3.4.0 image
kafka_2.13-3.4.0 patched image

latency score improved up to 25x times in high percentiles ^^^

The fix was done for 3.4 branch - scala version of ProducerStateManager. Trunk needs corresponding fix for ProducerStateManager.java.

@ijuma
Copy link
Copy Markdown
Member

ijuma commented May 26, 2023

We typically make changes to master first. Would you be willing to submit a PR for that instead?

@jolshan
Copy link
Copy Markdown
Member

jolshan commented May 27, 2023

Thanks for the PR! This looks promising. As Ismael said, let's share in trunk first.

@zhangsm6
Copy link
Copy Markdown

@novosibman What is the file system used in your test? OMB default should be xfs but wondering if it was changed

@novosibman
Copy link
Copy Markdown
Author

novosibman commented May 30, 2023

@novosibman What is the file system used in your test? OMB default should be xfs but wondering if it was changed

Yes, used xfs. This fs type was found in OMB settings.
Older experiments showed about 10x worse in high percentiles when using ext4 :
image
x-axis - timeline in minutes
// the test is quite short not including Kafka log segments rolling

Reference test configuration used:
https://developer.confluent.io/learn/kafka-performance/

@novosibman
Copy link
Copy Markdown
Author

We typically make changes to master first. Would you be willing to submit a PR for that instead?

Prepared and tested trunk version: #13782

@github-actions
Copy link
Copy Markdown

This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please ask a committer for review. If the PR has merge conflicts, please update it with the latest from trunk (or appropriate release branch)

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

@github-actions github-actions bot added the stale Stale PRs label Aug 29, 2023
@divijvaidya
Copy link
Copy Markdown
Member

This PR is fixed in trunk (scheduled for release in 3.7.0). Currently there are no plans of backporting this to earlier versions since this is a performance optimization and not a critical bug fix. I am closing this PR, please feel free to reopen if you think we still need to backport this.

@divijvaidya divijvaidya closed this Feb 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants