Skip to content

KAFKA-19571: Race condition between log segment flush and file deletion causing log dir to go offline#20289

Merged
mimaison merged 4 commits intoapache:trunkfrom
criteo-forks:KAFKA-19571
Jan 29, 2026
Merged

KAFKA-19571: Race condition between log segment flush and file deletion causing log dir to go offline#20289
mimaison merged 4 commits intoapache:trunkfrom
criteo-forks:KAFKA-19571

Conversation

@itoumlilt
Copy link
Copy Markdown
Contributor

@itoumlilt itoumlilt commented Aug 1, 2025

Following JIRA Ticket: https://issues.apache.org/jira/browse/KAFKA-19571

A race condition can occur during replica rebalancing where a log
segment's file is deleted after an asynchronous flush has been scheduled
but before it executes.

This would previously cause an unhandled ClosedChannelException,
leading the ReplicaManager to mark the entire log directory as
offline.

The fix involves catching the ClosedChannelException within the
LogSegment.flush() method and suppressing it only if the underlying
log file no longer exists, which is the specific symptom of this race
condition. Legitimate I/O errors on existing files will still be thrown.

Unit test has been added to LogSegmentTest to verify both the fix and
the case where the exception should still be thrown.

@github-actions github-actions bot added triage PRs from the community storage Pull requests that target the storage module small Small PRs labels Aug 1, 2025
@itoumlilt
Copy link
Copy Markdown
Contributor Author

#11438 was fixed to swallow the first NoSuchFileException WARN in the above stacktrace, but not the underlying exception.
#14280 is similar but different, it swallows NoSuchFileException for race condition on log directory move/delete, but not on the segment file level.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Aug 9, 2025

A label of 'needs-attention' was automatically added to this PR in order to raise the
attention of the committers. Once this issue has been triaged, the triage label
should be removed to prevent this automation from happening again.

Copy link
Copy Markdown
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itoumlilt : Thanks for identifying the issue and submitting a PR. Left a comment.

}
});
} catch (ClosedChannelException e) {
if (!log.file().exists()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In replaceCurrentWithFutureLog(), we rename the log dir to deleted, close the channel and schedule it to be deleted asynchronously. If we get here, it's possible that the renamed file still exists.

Also, while this fixes the issue with flush, the issue with closed channel could be exposed through read too and therefore cause the same issue of forcing the log directory to be offline.

An alternative is to avoid closing srcLog in replaceCurrentWithFutureLog(). The segments will be closed when the log is deleted after a delay. By that time, the expectation is that there won't be any pending flushes, reads, etc on the log. This approach is also consistent with the approach in LogManager.asyncDelete().

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junrao Thanks for the detailed review! You were right: handling this at the flush() level was incomplete as it didn't cover read operations.
I've updated the PR to implement your suggestion:

  1. Reverted the changes to LogSegment.java.
  2. Modified LogManager.replaceCurrentWithFutureLog to not close the source log immediately. This keeps the channel open for any pending operations on the renamed (.delete) directory.
  3. Added a new unit test testReplaceCurrentWithFutureLogDoesNotCloseSourceLog in LogManagerTest to verify that the source log remains open during the swap.

Please let me know if this looks good!

@github-actions github-actions bot removed needs-attention triage PRs from the community labels Aug 16, 2025
@github-actions
Copy link
Copy Markdown

This PR is being marked as stale since it has not had any activity in 90 days. If you
would like to keep this PR alive, please leave a comment asking for a review. If the PR has
merge conflicts, update it with the latest from the base branch.

If you are having difficulty finding a reviewer, please reach out on the [mailing list](https://kafka.apache.org/contact).

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

@github-actions github-actions bot added the stale Stale PRs label Nov 14, 2025
@github-actions github-actions bot added the core Kafka Broker label Nov 24, 2025
@itoumlilt
Copy link
Copy Markdown
Contributor Author

Proposed a new fix implementation, amend & push forced to previous commit.
Context : #20289 (comment)
This PR is again open to review!

@github-actions github-actions bot removed the stale Stale PRs label Nov 25, 2025
Copy link
Copy Markdown
Member

@mimaison mimaison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, I left a couple of suggestions

@@ -1249,7 +1247,6 @@ class LogManager(logDirs: Seq[File],
srcLog.renameDir(UnifiedLog.logDeleteDirName(topicPartition), true)
// Now that replica in source log directory has been successfully renamed for deletion.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to not close the log here, can we update the comment too? including explaining why

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 19057e4 , updated the comment to explain the reasoning.

@@ -43,7 +43,7 @@ import java.util.{Collections, Optional, OptionalLong, Properties}
import org.apache.kafka.server.metrics.KafkaYammerMetrics
import org.apache.kafka.server.storage.log.FetchIsolation
import org.apache.kafka.server.util.{FileLock, KafkaScheduler, MockTime, Scheduler}
import org.apache.kafka.storage.internals.log.{CleanerConfig, FetchDataInfo, LogConfig, LogDirFailureChannel, LogMetricNames, LogManager => JLogManager, LogOffsetsListener, LogStartOffsetIncrementReason, ProducerStateManagerConfig, RemoteIndexCache, UnifiedLog}
import org.apache.kafka.storage.internals.log.{CleanerConfig, FetchDataInfo, LogConfig, LogDirFailureChannel, LogFileUtils, LogMetricNames, LogOffsetsListener, LogStartOffsetIncrementReason, ProducerStateManagerConfig, RemoteIndexCache, UnifiedLog, LogManager => JLogManager}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we undo the reordering?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the import reordering in 19057e4

@mimaison
Copy link
Copy Markdown
Member

mimaison commented Jan 9, 2026

I've seen a similar failure.
It's not exactly the same stacktrace as in KAFKA-19571, but it seems to be the same underlying issue:

ERROR Error while flushing log for <TOPIC> in dir <LOG_DIR> with offset <VALUE> (exclusive) and recovery point <VALUE> (org.apache.kafka.storage.internals.log.LogDirFailureChannel)
java.nio.file.NoSuchFileException: <LOG_DIR>/<TOPIC>-future
	at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92)
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111)
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116)
	at java.base/sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:182)
	at java.base/java.nio.channels.FileChannel.open(FileChannel.java:292)
	at java.base/java.nio.channels.FileChannel.open(FileChannel.java:345)
	at org.apache.kafka.common.utils.Utils.flushDir(Utils.java:973)
	at kafka.log.LocalLog.flush(LocalLog.scala:177)
	at kafka.log.UnifiedLog.$anonfun$flush$2(UnifiedLog.scala:1537)
	at kafka.log.UnifiedLog.flush(UnifiedLog.scala:1724)
	at kafka.log.UnifiedLog.flushUptoOffsetExclusive(UnifiedLog.scala:1518)
	at kafka.log.UnifiedLog.$anonfun$roll$1(UnifiedLog.scala:1499)
	at org.apache.kafka.server.util.KafkaScheduler.lambda$schedule$1(KafkaScheduler.java:150)

@itoumlilt
Copy link
Copy Markdown
Contributor Author

Thanks @mimaison for your review and for sharing this stacktrace! We've encountered the same issue in production and this patch has resolved it for us.

Copy link
Copy Markdown
Member

@showuon showuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, left a comment to improve the test to make sure when the race condition happened again, the logSegment.flush() won't have exception thrown.

verify(spyCurrentLog, never()).close()

// Verify the source log was renamed to .delete
assertTrue(spyCurrentLog.dir.getName.endsWith(LogFileUtils.DELETE_DIR_SUFFIX))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also verify that in this situation (i.e. after replaceCurrentWithFutureLog is invoked without channel closed), the logSegment.flush() can be invoked without error?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thanks! Added in a1f4547 .

Comment on lines +1149 to +1151
// Verify that flush() can be called without error (no ClosedChannelException)
val flushLog: Executable = () => spyCurrentLog.flush(false)
assertDoesNotThrow(flushLog)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To trigger the flush, we have to set flushOffset > localLog.recoveryPoint() (here). Because there's no any data, the flushOffset is 0 and recoveryPoint is 0, too. I just tested it, and this flush will not throw exception even if we close the srcLog as before. We have to make the flushOffset > 0 to trigger the exception, something like this:

   // Verify that flush() can be called without error (no ClosedChannelException)
    when(spyCurrentLog.logEndOffset()).thenReturn(100L)
    val flushLog: Executable = () => spyCurrentLog.flush(false)
    assertDoesNotThrow(flushLog)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, thanks! Updated in 10e33f4

Copy link
Copy Markdown
Member

@showuon showuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

A race condition can occur during replica rebalancing where a log segment's
file is closed and deleted/renamed while an asynchronous flush or read
operation is still pending.

This would previously cause an unhandled ClosedChannelException, leading the
ReplicaManager to mark the entire log directory as offline.

The fix involves removing the explicit close() of the source log in
replaceCurrentWithFutureLog(). By leaving the channel open, concurrent
operations can complete successfully on the renamed files (which are moved
to the .delete directory). The log is already scheduled for asynchronous
deletion (via addLogToBeDeleted), which ensures that the log and its
resources will be properly closed and deleted by the background deletion
thread after the configured file delete delay.

A new unit test `testReplaceCurrentWithFutureLogDoesNotCloseSourceLog`
in `LogManagerTest` has been added to verify that the source log is not
closed during the swap operation.

The fix involves removing the explicit close() of the source log in
replaceCurrentWithFutureLog(). By leaving the channel open, concurrent
operations can complete successfully on the renamed files. The log is
already scheduled for asynchronous deletion (via addLogToBeDeleted), which
ensures that the log and its resources will be properly closed and deleted
by the background deletion thread after the configured file delete delay.
Verify that flush() can be called without ClosedChannelException after replaceCurrentWithFutureLog, confirming the race condition is resolved.
The flush only occurs when flushOffset > localLog.recoveryPoint()
Copy link
Copy Markdown
Member

@mimaison mimaison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, LGTM

I was wondering if there was a way to verify that UnifiedLog.flush() was actually calling the underlying flush() methods on LocalLog/LogSegment after the assertDoesNotThrow() assertion. But I couldn't find a nice way to do so.

@mimaison
Copy link
Copy Markdown
Member

@junrao Do you want to take another look?

@mimaison
Copy link
Copy Markdown
Member

Ok I'll go ahead and merge

@mimaison mimaison merged commit eaad6ed into apache:trunk Jan 29, 2026
20 checks passed
mimaison pushed a commit that referenced this pull request Jan 29, 2026
…on causing log dir to go offline (#20289)

Reviewers: Jun Rao <jun@confluent.io>, Luke Chen <showuon@gmail.com>, Mickael Maison <mickael.maison@gmail.com>
Copy link
Copy Markdown
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itoumlilt : Thanks for the updated PR. Sorry for the late review. LGTM. Just a minor comment.

// operations (e.g., log flusher, fetch requests) might encounter ClosedChannelException.
// The log will be deleted asynchronously by the background delete-logs thread.
// File handles are intentionally left open; Unix semantics allow the renamed files
// to remain accessible until all handles are closed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    //The log will be deleted asynchronously by the background delete-logs thread.
    // File handles are intentionally left open; Unix semantics allow the renamed files
    // to remain accessible until all handles are closed.

How about the following?

File handles are intentionally left open; Unix semantics allow the renamed files
to remain accessible until all handles are closed.
The log will be deleted asynchronously by the background delete-logs thread.
File handles are closed and files are deleted after a configured delay log.segment.delete.delay.ms.
At that time, the expectation is that no other concurrent operations need to access
the deleted file handles any more.

@junrao
Copy link
Copy Markdown
Contributor

junrao commented Jan 29, 2026

@itoumlilt Could you update the Proposed Fix section in the jira since it's outdated?

@mimaison : Since this is cherry-picked to 4.1, should we cherry-pick this to 4.2 if there is another RC?

@mimaison
Copy link
Copy Markdown
Member

Yes, we should backport to 4.2. I asked Christo on the dev list to see if we can sneak it in for 4.2.0. Otherwise I'll backport to 4.2 once 4.2.0 is out. I've done the backport to 4.1 yesterday, I'll do it for 4.0 next week (or feel free to do it if you have time)

mimaison pushed a commit that referenced this pull request Feb 2, 2026
…on causing log dir to go offline (#20289)

Reviewers: Jun Rao <jun@confluent.io>, Luke Chen <showuon@gmail.com>, Mickael Maison <mickael.maison@gmail.com>
clolov pushed a commit that referenced this pull request Feb 2, 2026
…on causing log dir to go offline (#20289)

Reviewers: Jun Rao <jun@confluent.io>, Luke Chen <showuon@gmail.com>, Mickael Maison <mickael.maison@gmail.com>
@mimaison
Copy link
Copy Markdown
Member

mimaison commented Feb 2, 2026

Cherry picked to 4.0, 4.1 and 4.2. The fix will be in 4.0.2, 4.1.2 and 4.2.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-approved core Kafka Broker small Small PRs storage Pull requests that target the storage module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants