Skip to content

KAFKA-17142: Fix deadlock caused by LogManagerTest#testLogRecoveryMetrics#16614

Merged
chia7712 merged 1 commit intoapache:trunkfrom
FrankYang0529:KAFKA-17142
Jul 18, 2024
Merged

KAFKA-17142: Fix deadlock caused by LogManagerTest#testLogRecoveryMetrics#16614
chia7712 merged 1 commit intoapache:trunkfrom
FrankYang0529:KAFKA-17142

Conversation

@FrankYang0529
Copy link
Copy Markdown
Member

In LogManagerTest#testLogRecoveryMetrics, add some delay to create second UnifiedLog to avoid deadlock.

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

val topicPartition = UnifiedLog.parseTopicPartitionName(dir)
val config = topicConfigOverrides.getOrElse(topicPartition.topic, logConfig)

if (dir == logDir2) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this is the one of the solutions I described in the jira. It does not change production code but it can't fix the issue totally.

If we don't want to touch production code, maybe another way is to change the folders from 2 to 1. Not sure whether this is valid to this test case.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FrankYang0529 Could you please try another approach? create the snapshot when holding the write lock?

Copy link
Copy Markdown
Member Author

@FrankYang0529 FrankYang0529 Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I tried to get epochs.values() in the write lock, so we don't need to get read lock again to get the value.

…rics

Signed-off-by: PoAn Yang <payang@apache.org>
Copy link
Copy Markdown
Member

@showuon showuon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this solution! Thanks for the fix.

@chia7712
Copy link
Copy Markdown
Member

@ocadaruma could you please take a look? I prefer to modify the production code to fix the deadlock in testing, because this solution is simple and make sense to me.

Copy link
Copy Markdown
Member

@chia7712 chia7712 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@chia7712 chia7712 merged commit cf9d517 into apache:trunk Jul 18, 2024
@chia7712
Copy link
Copy Markdown
Member

@ocadaruma please feel free to raise objection as follow-up. It causes our CI hanging, so I merge it for now.

@FrankYang0529 FrankYang0529 deleted the KAFKA-17142 branch July 18, 2024 11:30
Copy link
Copy Markdown
Contributor

@junrao junrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@FrankYang0529 : Thanks for identifying the problem and submitting a PR. Left a comment below.

// another truncateFromEnd call on log loading procedure, so it won't be a problem
scheduler.scheduleOnce("leader-epoch-cache-flush-" + topicPartition, this::writeToFileForTruncation);
List<EpochEntry> entries = new ArrayList<>(epochs.values());
scheduler.scheduleOnce("leader-epoch-cache-flush-" + topicPartition, () -> checkpoint.writeForTruncation(entries));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach introduces a new correctness issue. With this change, it's possible for older epoch entries to overwrite the newer epoch entries in the leader epoch file. Consider the following sequence: we take a snapshot of the epoch entries here; a new epoch entry is added and is flushed to disk; the scheduler then writes the snapshot to disk. This can lead to the case where the leader epoch file doesn't contain all entries up to the recovery point.

Since the issue is only in the test, I am wondering if we could fix the test directly. For example, perhaps we could introduce a NoOpScheduler and use it in the test, since the test doesn't depend on the leader epoch entries to be actually flushed to disk.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the issue is only in the test, I am wondering if we could fix the test directly. For example, perhaps we could introduce a NoOpScheduler and use it in the test, since the test doesn't depend on the leader epoch entries to be actually flushed to disk.

this is another good approach.

This approach introduces a new correctness issue. With this change, it's possible for older epoch entries to overwrite the newer epoch entries in the leader epoch file. Consider the following sequence: we take a snapshot of the epoch entries here; a new epoch entry is added and is flushed to disk; the scheduler then writes the snapshot to disk. This can lead to the case where the leader epoch file doesn't contain all entries up to the recovery point.

Sorry to cause possible correctness issue. @FrankYang0529 and I had discussed the approach offline when I noticed that deadlock, and I suggest to change the production code directly. It seems to me this PR does NOT change the execution order, because the "writeToFileForTruncation" does not hold the single lock to complete the "snapshot" and "flush".

    private void writeToFileForTruncation() {
        // phase 1: create snapshot by holding read lock
        List<EpochEntry> entries;
        lock.readLock().lock();
        try {
            entries = new ArrayList<>(epochs.values());
        } finally {
            lock.readLock().unlock();
        }
        // phase 2: flush by holding write lock
        checkpoint.writeForTruncation(entries);
    }

Hence, the issue you mentioned can happen even though we revert this PR. for example:

  1. writeToFileForTruncation (run by scheduler) take a snapshot of the epoch entries in phase 1 (see comment in above code)
  2. a new epoch entry is added and is flushed to disk
  3. writeToFileForTruncation (run by scheduler) then writes the snapshot to disk in phase 2 (see comment in above code)

In summary: there are two follow-up:

  1. rewrite testLogRecoveryMetrics by NoOpScheduler
  2. add writeToFileForTruncation back except for "snapshot". for example:
    private void writeToFileForTruncation() {
        lock.readLock().lock();
        try {
            checkpoint.writeForTruncation(epochs.values());
        } finally {
            lock.readLock().unlock();
        }
    }

@junrao WDYT?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggestion makes sense to me.

Copy link
Copy Markdown
Contributor

@ocadaruma ocadaruma Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chia7712 Hi, thank you for pointing out the potential race issue exists even on current code.

The follow-up looks good to me.

For follow-up 2 which moves checkpoint-flush to inside the lock, one concern is potential request-handler/replica-fetcher thread blocking due to the fsync latency. (i.e. threads call truncateFromStart/EndAsyncFlush will be blocked meanwhile)
However it might not be the critical performance issue because:

  • These methods are not called frequently (typical call path is truncation during fetch response handling and deleteRecords handling), so it will unlikely be called when writeToFileForTruncation (scheduled by previous method call) is ongoing and causes lock contention.
    • Unless kafka-schedulers are very busy and task execution is delayed

Let me consider if some optimization is possible for this as an another follow-up.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chia7712 : Yes, you are right that the overwriting issue was already introduced in #15993. Moving the flush call inside the read lock fixes this issue, but it defeats the original performance optimization in #14242. @ocadaruma : What's your opinion on this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi all, thanks for raising the correctness issue. IMO, we can fix data correctness first, and then improve performance if it doesn't break data correctness.

I will rewrite testLogRecoveryMetrics with NoOpScheduler first and see whether need to improve LeaderEpochFileCache performance with its own scheduler. Thank you.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ocadaruma : Thanks for the explanation. Yes, I agree that the async flush still gives us some perf benefits. As for the fix, the two followups suggested by @chia7712 sound reasonable to me. They probably should be done in the same PR?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They probably should be done in the same PR?

I assumed that it needs more discussion for the changes of production code. For example:

Yeah, could be an issue in some cases (e.g. deleteRecords is called frequently, and/or kafka-schedulers are busy) though.

The two follow-ups are orthogonal now, and hence I prefer to fix them separately to avoid unnecessary block.

BTW, please feel free to leave more comments on the https://issues.apache.org/jira/browse/KAFKA-17167 for the fix.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I thought the simple fix you suggested is to do the following. This will bring back the deadlock issue in the test, right?

private void writeToFileForTruncation() { 
  lock.readLock().lock(); 
  try { checkpoint.writeForTruncation(epochs.values()); 
  } finally { lock.readLock().unlock(); } 
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will bring back the deadlock issue in the test, right?

yes, it does. However, my point was - if it needs more discussion for @ocadaruma comment: "Yeah, could be an issue in some cases (e.g. deleteRecords is called frequently, and/or kafka-schedulers are busy) though.", we can improve the test before adding writeToFileForTruncation back to production.

At any rate, it seems we all agree to have the simple fix for now, and so I merge KAFKA-17166 and KAFKA-17167

chia7712 pushed a commit that referenced this pull request Jul 22, 2024
…ics (#16614)

Reviewers: Luke Chen <showuon@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
abhi-ksolves pushed a commit to ksolves/kafka that referenced this pull request Jul 31, 2024
…ics (apache#16614)

Reviewers: Luke Chen <showuon@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants