KAFKA-17142: Fix deadlock caused by LogManagerTest#testLogRecoveryMetrics#16614
KAFKA-17142: Fix deadlock caused by LogManagerTest#testLogRecoveryMetrics#16614chia7712 merged 1 commit intoapache:trunkfrom
Conversation
| val topicPartition = UnifiedLog.parseTopicPartitionName(dir) | ||
| val config = topicConfigOverrides.getOrElse(topicPartition.topic, logConfig) | ||
|
|
||
| if (dir == logDir2) { |
There was a problem hiding this comment.
Yep, this is the one of the solutions I described in the jira. It does not change production code but it can't fix the issue totally.
If we don't want to touch production code, maybe another way is to change the folders from 2 to 1. Not sure whether this is valid to this test case.
There was a problem hiding this comment.
@FrankYang0529 Could you please try another approach? create the snapshot when holding the write lock?
There was a problem hiding this comment.
Yes, I tried to get epochs.values() in the write lock, so we don't need to get read lock again to get the value.
…rics Signed-off-by: PoAn Yang <payang@apache.org>
536db09 to
d34489b
Compare
showuon
left a comment
There was a problem hiding this comment.
I like this solution! Thanks for the fix.
|
@ocadaruma could you please take a look? I prefer to modify the production code to fix the deadlock in testing, because this solution is simple and make sense to me. |
|
@ocadaruma please feel free to raise objection as follow-up. It causes our CI hanging, so I merge it for now. |
junrao
left a comment
There was a problem hiding this comment.
@FrankYang0529 : Thanks for identifying the problem and submitting a PR. Left a comment below.
| // another truncateFromEnd call on log loading procedure, so it won't be a problem | ||
| scheduler.scheduleOnce("leader-epoch-cache-flush-" + topicPartition, this::writeToFileForTruncation); | ||
| List<EpochEntry> entries = new ArrayList<>(epochs.values()); | ||
| scheduler.scheduleOnce("leader-epoch-cache-flush-" + topicPartition, () -> checkpoint.writeForTruncation(entries)); |
There was a problem hiding this comment.
This approach introduces a new correctness issue. With this change, it's possible for older epoch entries to overwrite the newer epoch entries in the leader epoch file. Consider the following sequence: we take a snapshot of the epoch entries here; a new epoch entry is added and is flushed to disk; the scheduler then writes the snapshot to disk. This can lead to the case where the leader epoch file doesn't contain all entries up to the recovery point.
Since the issue is only in the test, I am wondering if we could fix the test directly. For example, perhaps we could introduce a NoOpScheduler and use it in the test, since the test doesn't depend on the leader epoch entries to be actually flushed to disk.
There was a problem hiding this comment.
Since the issue is only in the test, I am wondering if we could fix the test directly. For example, perhaps we could introduce a NoOpScheduler and use it in the test, since the test doesn't depend on the leader epoch entries to be actually flushed to disk.
this is another good approach.
This approach introduces a new correctness issue. With this change, it's possible for older epoch entries to overwrite the newer epoch entries in the leader epoch file. Consider the following sequence: we take a snapshot of the epoch entries here; a new epoch entry is added and is flushed to disk; the scheduler then writes the snapshot to disk. This can lead to the case where the leader epoch file doesn't contain all entries up to the recovery point.
Sorry to cause possible correctness issue. @FrankYang0529 and I had discussed the approach offline when I noticed that deadlock, and I suggest to change the production code directly. It seems to me this PR does NOT change the execution order, because the "writeToFileForTruncation" does not hold the single lock to complete the "snapshot" and "flush".
private void writeToFileForTruncation() {
// phase 1: create snapshot by holding read lock
List<EpochEntry> entries;
lock.readLock().lock();
try {
entries = new ArrayList<>(epochs.values());
} finally {
lock.readLock().unlock();
}
// phase 2: flush by holding write lock
checkpoint.writeForTruncation(entries);
}Hence, the issue you mentioned can happen even though we revert this PR. for example:
writeToFileForTruncation(run by scheduler) take a snapshot of the epoch entries in phase 1 (see comment in above code)- a new epoch entry is added and is flushed to disk
writeToFileForTruncation(run by scheduler) then writes the snapshot to disk in phase 2 (see comment in above code)
In summary: there are two follow-up:
- rewrite
testLogRecoveryMetricsbyNoOpScheduler - add
writeToFileForTruncationback except for "snapshot". for example:
private void writeToFileForTruncation() {
lock.readLock().lock();
try {
checkpoint.writeForTruncation(epochs.values());
} finally {
lock.readLock().unlock();
}
}@junrao WDYT?
There was a problem hiding this comment.
The suggestion makes sense to me.
There was a problem hiding this comment.
@chia7712 Hi, thank you for pointing out the potential race issue exists even on current code.
The follow-up looks good to me.
For follow-up 2 which moves checkpoint-flush to inside the lock, one concern is potential request-handler/replica-fetcher thread blocking due to the fsync latency. (i.e. threads call truncateFromStart/EndAsyncFlush will be blocked meanwhile)
However it might not be the critical performance issue because:
- These methods are not called frequently (typical call path is truncation during fetch response handling and deleteRecords handling), so it will unlikely be called when writeToFileForTruncation (scheduled by previous method call) is ongoing and causes lock contention.
- Unless kafka-schedulers are very busy and task execution is delayed
Let me consider if some optimization is possible for this as an another follow-up.
There was a problem hiding this comment.
@chia7712 : Yes, you are right that the overwriting issue was already introduced in #15993. Moving the flush call inside the read lock fixes this issue, but it defeats the original performance optimization in #14242. @ocadaruma : What's your opinion on this?
There was a problem hiding this comment.
Hi all, thanks for raising the correctness issue. IMO, we can fix data correctness first, and then improve performance if it doesn't break data correctness.
I will rewrite testLogRecoveryMetrics with NoOpScheduler first and see whether need to improve LeaderEpochFileCache performance with its own scheduler. Thank you.
There was a problem hiding this comment.
@ocadaruma : Thanks for the explanation. Yes, I agree that the async flush still gives us some perf benefits. As for the fix, the two followups suggested by @chia7712 sound reasonable to me. They probably should be done in the same PR?
There was a problem hiding this comment.
They probably should be done in the same PR?
I assumed that it needs more discussion for the changes of production code. For example:
Yeah, could be an issue in some cases (e.g. deleteRecords is called frequently, and/or kafka-schedulers are busy) though.
The two follow-ups are orthogonal now, and hence I prefer to fix them separately to avoid unnecessary block.
BTW, please feel free to leave more comments on the https://issues.apache.org/jira/browse/KAFKA-17167 for the fix.
There was a problem hiding this comment.
Hmm, I thought the simple fix you suggested is to do the following. This will bring back the deadlock issue in the test, right?
private void writeToFileForTruncation() {
lock.readLock().lock();
try { checkpoint.writeForTruncation(epochs.values());
} finally { lock.readLock().unlock(); }
}
There was a problem hiding this comment.
This will bring back the deadlock issue in the test, right?
yes, it does. However, my point was - if it needs more discussion for @ocadaruma comment: "Yeah, could be an issue in some cases (e.g. deleteRecords is called frequently, and/or kafka-schedulers are busy) though.", we can improve the test before adding writeToFileForTruncation back to production.
At any rate, it seems we all agree to have the simple fix for now, and so I merge KAFKA-17166 and KAFKA-17167
…ics (#16614) Reviewers: Luke Chen <showuon@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
…ics (apache#16614) Reviewers: Luke Chen <showuon@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>
In LogManagerTest#testLogRecoveryMetrics, add some delay to create second UnifiedLog to avoid deadlock.
Committer Checklist (excluded from commit message)