KAFKA-17142: Fix deadlock caused by LogManagerTest#testLogRecoveryMetrics by FrankYang0529 · Pull Request #16614 · apache/kafka

FrankYang0529 · 2024-07-17T13:43:19Z

In LogManagerTest#testLogRecoveryMetrics, add some delay to create second UnifiedLog to avoid deadlock.

Committer Checklist (excluded from commit message)

Verify design and implementation
Verify test coverage and CI build status
Verify documentation (including upgrade notes)

chia7712 · 2024-07-17T13:54:11Z

      val topicPartition = UnifiedLog.parseTopicPartitionName(dir)
      val config = topicConfigOverrides.getOrElse(topicPartition.topic, logConfig)

+      if (dir == logDir2) {


Yep, this is the one of the solutions I described in the jira. It does not change production code but it can't fix the issue totally.

If we don't want to touch production code, maybe another way is to change the folders from 2 to 1. Not sure whether this is valid to this test case.

@FrankYang0529 Could you please try another approach? create the snapshot when holding the write lock?

Yes, I tried to get epochs.values() in the write lock, so we don't need to get read lock again to get the value.

…rics Signed-off-by: PoAn Yang <payang@apache.org>

showuon

I like this solution! Thanks for the fix.

chia7712 · 2024-07-18T06:17:56Z

@ocadaruma could you please take a look? I prefer to modify the production code to fix the deadlock in testing, because this solution is simple and make sense to me.

chia7712

LGTM

chia7712 · 2024-07-18T10:41:35Z

@ocadaruma please feel free to raise objection as follow-up. It causes our CI hanging, so I merge it for now.

junrao

@FrankYang0529 : Thanks for identifying the problem and submitting a PR. Left a comment below.

junrao · 2024-07-18T18:55:10Z

                //     another truncateFromEnd call on log loading procedure, so it won't be a problem
-                scheduler.scheduleOnce("leader-epoch-cache-flush-" + topicPartition, this::writeToFileForTruncation);
+                List<EpochEntry> entries = new ArrayList<>(epochs.values());
+                scheduler.scheduleOnce("leader-epoch-cache-flush-" + topicPartition, () -> checkpoint.writeForTruncation(entries));


This approach introduces a new correctness issue. With this change, it's possible for older epoch entries to overwrite the newer epoch entries in the leader epoch file. Consider the following sequence: we take a snapshot of the epoch entries here; a new epoch entry is added and is flushed to disk; the scheduler then writes the snapshot to disk. This can lead to the case where the leader epoch file doesn't contain all entries up to the recovery point.

Since the issue is only in the test, I am wondering if we could fix the test directly. For example, perhaps we could introduce a NoOpScheduler and use it in the test, since the test doesn't depend on the leader epoch entries to be actually flushed to disk.

Since the issue is only in the test, I am wondering if we could fix the test directly. For example, perhaps we could introduce a NoOpScheduler and use it in the test, since the test doesn't depend on the leader epoch entries to be actually flushed to disk.

this is another good approach.

This approach introduces a new correctness issue. With this change, it's possible for older epoch entries to overwrite the newer epoch entries in the leader epoch file. Consider the following sequence: we take a snapshot of the epoch entries here; a new epoch entry is added and is flushed to disk; the scheduler then writes the snapshot to disk. This can lead to the case where the leader epoch file doesn't contain all entries up to the recovery point.

Sorry to cause possible correctness issue. @FrankYang0529 and I had discussed the approach offline when I noticed that deadlock, and I suggest to change the production code directly. It seems to me this PR does NOT change the execution order, because the "writeToFileForTruncation" does not hold the single lock to complete the "snapshot" and "flush".

private void writeToFileForTruncation() { // phase 1: create snapshot by holding read lock List<EpochEntry> entries; lock.readLock().lock(); try { entries = new ArrayList<>(epochs.values()); } finally { lock.readLock().unlock(); } // phase 2: flush by holding write lock checkpoint.writeForTruncation(entries); }

Hence, the issue you mentioned can happen even though we revert this PR. for example:

writeToFileForTruncation (run by scheduler) take a snapshot of the epoch entries in phase 1 (see comment in above code)

a new epoch entry is added and is flushed to disk

writeToFileForTruncation (run by scheduler) then writes the snapshot to disk in phase 2 (see comment in above code)

In summary: there are two follow-up:

rewrite testLogRecoveryMetrics by NoOpScheduler

add writeToFileForTruncation back except for "snapshot". for example:

private void writeToFileForTruncation() { lock.readLock().lock(); try { checkpoint.writeForTruncation(epochs.values()); } finally { lock.readLock().unlock(); } }

@junrao WDYT?

The suggestion makes sense to me.

@chia7712 Hi, thank you for pointing out the potential race issue exists even on current code.

The follow-up looks good to me.

For follow-up 2 which moves checkpoint-flush to inside the lock, one concern is potential request-handler/replica-fetcher thread blocking due to the fsync latency. (i.e. threads call truncateFromStart/EndAsyncFlush will be blocked meanwhile)
However it might not be the critical performance issue because:

These methods are not called frequently (typical call path is truncation during fetch response handling and deleteRecords handling), so it will unlikely be called when writeToFileForTruncation (scheduled by previous method call) is ongoing and causes lock contention.

Unless kafka-schedulers are very busy and task execution is delayed

Let me consider if some optimization is possible for this as an another follow-up.

@chia7712 : Yes, you are right that the overwriting issue was already introduced in #15993. Moving the flush call inside the read lock fixes this issue, but it defeats the original performance optimization in #14242. @ocadaruma : What's your opinion on this?

Hi all, thanks for raising the correctness issue. IMO, we can fix data correctness first, and then improve performance if it doesn't break data correctness.

I will rewrite testLogRecoveryMetrics with NoOpScheduler first and see whether need to improve LeaderEpochFileCache performance with its own scheduler. Thank you.

@ocadaruma : Thanks for the explanation. Yes, I agree that the async flush still gives us some perf benefits. As for the fix, the two followups suggested by @chia7712 sound reasonable to me. They probably should be done in the same PR?

They probably should be done in the same PR?

I assumed that it needs more discussion for the changes of production code. For example:

Yeah, could be an issue in some cases (e.g. deleteRecords is called frequently, and/or kafka-schedulers are busy) though.

The two follow-ups are orthogonal now, and hence I prefer to fix them separately to avoid unnecessary block.

BTW, please feel free to leave more comments on the https://issues.apache.org/jira/browse/KAFKA-17167 for the fix.

Hmm, I thought the simple fix you suggested is to do the following. This will bring back the deadlock issue in the test, right?

private void writeToFileForTruncation() { lock.readLock().lock(); try { checkpoint.writeForTruncation(epochs.values()); } finally { lock.readLock().unlock(); } }

This will bring back the deadlock issue in the test, right?

yes, it does. However, my point was - if it needs more discussion for @ocadaruma comment: "Yeah, could be an issue in some cases (e.g. deleteRecords is called frequently, and/or kafka-schedulers are busy) though.", we can improve the test before adding writeToFileForTruncation back to production.

At any rate, it seems we all agree to have the simple fix for now, and so I merge KAFKA-17166 and KAFKA-17167

…ics (#16614) Reviewers: Luke Chen <showuon@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>

…ics (apache#16614) Reviewers: Luke Chen <showuon@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>

chia7712 reviewed Jul 17, 2024

View reviewed changes

KAFKA-17142: Fix deadlock caused by LogManagerTest#testLogRecoveryMet…

d34489b

…rics Signed-off-by: PoAn Yang <payang@apache.org>

FrankYang0529 force-pushed the KAFKA-17142 branch from 536db09 to d34489b Compare July 18, 2024 05:25

showuon approved these changes Jul 18, 2024

View reviewed changes

chia7712 mentioned this pull request Jul 18, 2024

KAFKA-17104: fix InvalidMessageCrcRecordsPerSec is not updated in validating Legac… #16558

Merged

3 tasks

chia7712 approved these changes Jul 18, 2024

View reviewed changes

chia7712 merged commit cf9d517 into apache:trunk Jul 18, 2024

FrankYang0529 deleted the KAFKA-17142 branch July 18, 2024 11:30

junrao reviewed Jul 18, 2024

View reviewed changes

chia7712 mentioned this pull request Jul 22, 2024

KAFKA-17166: Use NoOpScheduler to rewrite LogManagerTest#testLogRecoveryMetrics #16641

Merged

3 tasks

chia7712 pushed a commit that referenced this pull request Jul 22, 2024

KAFKA-17142 Fix deadlock caused by LogManagerTest#testLogRecoveryMetr…

29e7796

…ics (#16614) Reviewers: Luke Chen <showuon@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>

abhi-ksolves pushed a commit to ksolves/kafka that referenced this pull request Jul 31, 2024

KAFKA-17142 Fix deadlock caused by LogManagerTest#testLogRecoveryMetr…

95d8241

…ics (apache#16614) Reviewers: Luke Chen <showuon@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>

Conversation

FrankYang0529 commented Jul 17, 2024

Committer Checklist (excluded from commit message)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FrankYang0529 Jul 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

showuon left a comment

Choose a reason for hiding this comment

Uh oh!

chia7712 commented Jul 18, 2024

Uh oh!

chia7712 left a comment

Choose a reason for hiding this comment

Uh oh!

chia7712 commented Jul 18, 2024

Uh oh!

junrao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ocadaruma Jul 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

FrankYang0529 Jul 18, 2024 •

edited

Loading

ocadaruma Jul 19, 2024 •

edited

Loading