KAFKA-19571: Race condition between log segment flush and file deletion causing log dir to go offline by itoumlilt · Pull Request #20289 · apache/kafka

itoumlilt · 2025-08-01T15:05:13Z

Following JIRA Ticket: https://issues.apache.org/jira/browse/KAFKA-19571

A race condition can occur during replica rebalancing where a log
segment's file is deleted after an asynchronous flush has been scheduled
but before it executes.

This would previously cause an unhandled ClosedChannelException,
leading the ReplicaManager to mark the entire log directory as
offline.

The fix involves catching the ClosedChannelException within the
LogSegment.flush() method and suppressing it only if the underlying
log file no longer exists, which is the specific symptom of this race
condition. Legitimate I/O errors on existing files will still be thrown.

Unit test has been added to LogSegmentTest to verify both the fix and
the case where the exception should still be thrown.

itoumlilt · 2025-08-07T10:02:20Z

#11438 was fixed to swallow the first NoSuchFileException WARN in the above stacktrace, but not the underlying exception.
#14280 is similar but different, it swallows NoSuchFileException for race condition on log directory move/delete, but not on the segment file level.

github-actions · 2025-08-09T03:31:13Z

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

junrao

@itoumlilt : Thanks for identifying the issue and submitting a PR. Left a comment.

junrao · 2025-08-15T19:10:09Z

                }
            });
+        } catch (ClosedChannelException e) {
+            if (!log.file().exists()) {


In replaceCurrentWithFutureLog(), we rename the log dir to deleted, close the channel and schedule it to be deleted asynchronously. If we get here, it's possible that the renamed file still exists.

Also, while this fixes the issue with flush, the issue with closed channel could be exposed through read too and therefore cause the same issue of forcing the log directory to be offline.

An alternative is to avoid closing srcLog in replaceCurrentWithFutureLog(). The segments will be closed when the log is deleted after a delay. By that time, the expectation is that there won't be any pending flushes, reads, etc on the log. This approach is also consistent with the approach in LogManager.asyncDelete().

@junrao Thanks for the detailed review! You were right: handling this at the flush() level was incomplete as it didn't cover read operations.
I've updated the PR to implement your suggestion:

Reverted the changes to LogSegment.java.

Modified LogManager.replaceCurrentWithFutureLog to not close the source log immediately. This keeps the channel open for any pending operations on the renamed (.delete) directory.

Added a new unit test testReplaceCurrentWithFutureLogDoesNotCloseSourceLog in LogManagerTest to verify that the source log remains open during the swap.

Please let me know if this looks good!

github-actions · 2025-11-14T03:46:11Z

This PR is being marked as stale since it has not had any activity in 90 days. If you
would like to keep this PR alive, please leave a comment asking for a review. If the PR has
merge conflicts, update it with the latest from the base branch.

If you are having difficulty finding a reviewer, please reach out on the [mailing list](https://kafka.apache.org/contact).

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

itoumlilt · 2025-11-24T18:55:52Z

Proposed a new fix implementation, amend & push forced to previous commit.
Context : #20289 (comment)
This PR is again open to review!

mimaison

Thanks for the PR, I left a couple of suggestions

mimaison · 2026-01-09T14:31:22Z

@@ -1249,7 +1247,6 @@ class LogManager(logDirs: Seq[File],
        srcLog.renameDir(UnifiedLog.logDeleteDirName(topicPartition), true)
        // Now that replica in source log directory has been successfully renamed for deletion.


If we decide to not close the log here, can we update the comment too? including explaining why

Done in 19057e4 , updated the comment to explain the reasoning.

mimaison · 2026-01-09T14:32:48Z

@@ -43,7 +43,7 @@ import java.util.{Collections, Optional, OptionalLong, Properties}
 import org.apache.kafka.server.metrics.KafkaYammerMetrics
 import org.apache.kafka.server.storage.log.FetchIsolation
 import org.apache.kafka.server.util.{FileLock, KafkaScheduler, MockTime, Scheduler}
-import org.apache.kafka.storage.internals.log.{CleanerConfig, FetchDataInfo, LogConfig, LogDirFailureChannel, LogMetricNames, LogManager => JLogManager, LogOffsetsListener, LogStartOffsetIncrementReason, ProducerStateManagerConfig, RemoteIndexCache, UnifiedLog}
+import org.apache.kafka.storage.internals.log.{CleanerConfig, FetchDataInfo, LogConfig, LogDirFailureChannel, LogFileUtils, LogMetricNames, LogOffsetsListener, LogStartOffsetIncrementReason, ProducerStateManagerConfig, RemoteIndexCache, UnifiedLog, LogManager => JLogManager}


Can we undo the reordering?

Updated the import reordering in 19057e4

mimaison · 2026-01-09T14:51:42Z

I've seen a similar failure.
It's not exactly the same stacktrace as in KAFKA-19571, but it seems to be the same underlying issue:

ERROR Error while flushing log for <TOPIC> in dir <LOG_DIR> with offset <VALUE> (exclusive) and recovery point <VALUE> (org.apache.kafka.storage.internals.log.LogDirFailureChannel)
java.nio.file.NoSuchFileException: <LOG_DIR>/<TOPIC>-future
	at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
	at java.base/sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:182)
	at java.base/java.nio.channels.FileChannel.open(FileChannel.java:292)
	at java.base/java.nio.channels.FileChannel.open(FileChannel.java:345)
	at org.apache.kafka.common.utils.Utils.flushDir(Utils.java:973)
	at kafka.log.LocalLog.flush(LocalLog.scala:177)
	at kafka.log.UnifiedLog.$anonfun$flush$2(UnifiedLog.scala:1537)
	at kafka.log.UnifiedLog.flush(UnifiedLog.scala:1724)
	at kafka.log.UnifiedLog.flushUptoOffsetExclusive(UnifiedLog.scala:1518)
	at kafka.log.UnifiedLog.$anonfun$roll$1(UnifiedLog.scala:1499)
	at org.apache.kafka.server.util.KafkaScheduler.lambda$schedule$1(KafkaScheduler.java:150)

itoumlilt · 2026-01-12T09:11:23Z

Thanks @mimaison for your review and for sharing this stacktrace! We've encountered the same issue in production and this patch has resolved it for us.

showuon

Overall LGTM, left a comment to improve the test to make sure when the race condition happened again, the logSegment.flush() won't have exception thrown.

showuon · 2026-01-13T06:24:51Z

+    verify(spyCurrentLog, never()).close()
+
+    // Verify the source log was renamed to .delete
+    assertTrue(spyCurrentLog.dir.getName.endsWith(LogFileUtils.DELETE_DIR_SUFFIX))


Could we also verify that in this situation (i.e. after replaceCurrentWithFutureLog is invoked without channel closed), the logSegment.flush() can be invoked without error?

Good point, thanks! Added in a1f4547 .

showuon · 2026-01-13T10:08:20Z

+    // Verify that flush() can be called without error (no ClosedChannelException)
+    val flushLog: Executable = () => spyCurrentLog.flush(false)
+    assertDoesNotThrow(flushLog)


To trigger the flush, we have to set flushOffset > localLog.recoveryPoint() (here). Because there's no any data, the flushOffset is 0 and recoveryPoint is 0, too. I just tested it, and this flush will not throw exception even if we close the srcLog as before. We have to make the flushOffset > 0 to trigger the exception, something like this:

// Verify that flush() can be called without error (no ClosedChannelException) when(spyCurrentLog.logEndOffset()).thenReturn(100L) val flushLog: Executable = () => spyCurrentLog.flush(false) assertDoesNotThrow(flushLog)

Indeed, thanks! Updated in 10e33f4

showuon

LGTM!

A race condition can occur during replica rebalancing where a log segment's file is closed and deleted/renamed while an asynchronous flush or read operation is still pending. This would previously cause an unhandled ClosedChannelException, leading the ReplicaManager to mark the entire log directory as offline. The fix involves removing the explicit close() of the source log in replaceCurrentWithFutureLog(). By leaving the channel open, concurrent operations can complete successfully on the renamed files (which are moved to the .delete directory). The log is already scheduled for asynchronous deletion (via addLogToBeDeleted), which ensures that the log and its resources will be properly closed and deleted by the background deletion thread after the configured file delete delay. A new unit test `testReplaceCurrentWithFutureLogDoesNotCloseSourceLog` in `LogManagerTest` has been added to verify that the source log is not closed during the swap operation. The fix involves removing the explicit close() of the source log in replaceCurrentWithFutureLog(). By leaving the channel open, concurrent operations can complete successfully on the renamed files. The log is already scheduled for asynchronous deletion (via addLogToBeDeleted), which ensures that the log and its resources will be properly closed and deleted by the background deletion thread after the configured file delete delay.

Verify that flush() can be called without ClosedChannelException after replaceCurrentWithFutureLog, confirming the race condition is resolved.

The flush only occurs when flushOffset > localLog.recoveryPoint()

mimaison

Thanks for the PR, LGTM

I was wondering if there was a way to verify that UnifiedLog.flush() was actually calling the underlying flush() methods on LocalLog/LogSegment after the assertDoesNotThrow() assertion. But I couldn't find a nice way to do so.

mimaison · 2026-01-20T13:33:13Z

@junrao Do you want to take another look?

mimaison · 2026-01-29T11:10:10Z

Ok I'll go ahead and merge

…on causing log dir to go offline (#20289) Reviewers: Jun Rao <jun@confluent.io>, Luke Chen <showuon@gmail.com>, Mickael Maison <mickael.maison@gmail.com>

junrao

@itoumlilt : Thanks for the updated PR. Sorry for the late review. LGTM. Just a minor comment.

junrao · 2026-01-29T22:34:25Z

+        // operations (e.g., log flusher, fetch requests) might encounter ClosedChannelException.
+        // The log will be deleted asynchronously by the background delete-logs thread.
+        // File handles are intentionally left open; Unix semantics allow the renamed files
+        // to remain accessible until all handles are closed.


//The log will be deleted asynchronously by the background delete-logs thread. // File handles are intentionally left open; Unix semantics allow the renamed files // to remain accessible until all handles are closed.

How about the following?

File handles are intentionally left open; Unix semantics allow the renamed files
to remain accessible until all handles are closed.
The log will be deleted asynchronously by the background delete-logs thread.
File handles are closed and files are deleted after a configured delay log.segment.delete.delay.ms.
At that time, the expectation is that no other concurrent operations need to access
the deleted file handles any more.

junrao · 2026-01-29T22:59:13Z

@itoumlilt Could you update the Proposed Fix section in the jira since it's outdated?

@mimaison : Since this is cherry-picked to 4.1, should we cherry-pick this to 4.2 if there is another RC?

mimaison · 2026-01-30T17:28:17Z

Yes, we should backport to 4.2. I asked Christo on the dev list to see if we can sneak it in for 4.2.0. Otherwise I'll backport to 4.2 once 4.2.0 is out. I've done the backport to 4.1 yesterday, I'll do it for 4.0 next week (or feel free to do it if you have time)

…on causing log dir to go offline (#20289) Reviewers: Jun Rao <jun@confluent.io>, Luke Chen <showuon@gmail.com>, Mickael Maison <mickael.maison@gmail.com>

mimaison · 2026-02-02T11:55:29Z

Cherry picked to 4.0, 4.1 and 4.2. The fix will be in 4.0.2, 4.1.2 and 4.2.0

github-actions bot added triage PRs from the community storage Pull requests that target the storage module small Small PRs labels Aug 1, 2025

github-actions bot added the needs-attention label Aug 9, 2025

junrao reviewed Aug 15, 2025

View reviewed changes

github-actions bot removed needs-attention triage PRs from the community labels Aug 16, 2025

github-actions bot added the stale Stale PRs label Nov 14, 2025

itoumlilt force-pushed the KAFKA-19571 branch from b3c93af to 7455a35 Compare November 24, 2025 18:03

github-actions bot added the core Kafka Broker label Nov 24, 2025

github-actions bot removed the stale Stale PRs label Nov 25, 2025

mimaison reviewed Jan 9, 2026

View reviewed changes

mimaison added the ci-approved label Jan 9, 2026

itoumlilt force-pushed the KAFKA-19571 branch from 19057e4 to 7d46e72 Compare January 12, 2026 12:42

showuon reviewed Jan 13, 2026

View reviewed changes

showuon approved these changes Jan 13, 2026

View reviewed changes

itoumlilt added 4 commits January 13, 2026 19:32

Address PR feedback: update comment and fix import ordering

fb4951a

Add flush verification to test race condition fix

1776ca4

Verify that flush() can be called without ClosedChannelException after replaceCurrentWithFutureLog, confirming the race condition is resolved.

Mock logEndOffset to trigger actual flush in test

da387d0

The flush only occurs when flushOffset > localLog.recoveryPoint()

itoumlilt force-pushed the KAFKA-19571 branch from 10e33f4 to da387d0 Compare January 13, 2026 18:32

mimaison approved these changes Jan 14, 2026

View reviewed changes

mimaison merged commit eaad6ed into apache:trunk Jan 29, 2026
20 checks passed

junrao reviewed Jan 29, 2026

View reviewed changes

		@@ -1249,7 +1247,6 @@ class LogManager(logDirs: Seq[File],
		srcLog.renameDir(UnifiedLog.logDeleteDirName(topicPartition), true)
		// Now that replica in source log directory has been successfully renamed for deletion.

Conversation

itoumlilt commented Aug 1, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

itoumlilt commented Aug 7, 2025

Uh oh!

github-actions bot commented Aug 9, 2025

Uh oh!

junrao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 14, 2025

Uh oh!

itoumlilt commented Nov 24, 2025

Uh oh!

mimaison left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mimaison commented Jan 9, 2026

Uh oh!

itoumlilt commented Jan 12, 2026

Uh oh!

showuon left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

showuon left a comment

Choose a reason for hiding this comment

Uh oh!

mimaison left a comment

Choose a reason for hiding this comment

Uh oh!

mimaison commented Jan 20, 2026

Uh oh!

mimaison commented Jan 29, 2026

Uh oh!

Uh oh!

junrao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

junrao commented Jan 29, 2026

Uh oh!

mimaison commented Jan 30, 2026

Uh oh!

mimaison commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

itoumlilt commented Aug 1, 2025 •

edited by github-actions bot

Loading