Skip to content

eventstore: fix checkpoint update race#5019

Merged
ti-chi-bot[bot] merged 8 commits into
masterfrom
ldz/optimize-event-store001
May 26, 2026
Merged

eventstore: fix checkpoint update race#5019
ti-chi-bot[bot] merged 8 commits into
masterfrom
ldz/optimize-event-store001

Conversation

@lidezhu
Copy link
Copy Markdown
Collaborator

@lidezhu lidezhu commented May 9, 2026

What problem does this PR solve?

Issue Number: close #4992

What is changed and how it works?

This pull request addresses race conditions in the eventstore's checkpoint update mechanism. By transitioning key checkpoint fields to atomic types and implementing monotonic update logic, the system now safely handles concurrent updates and prevents regression of checkpoint timestamps. Additionally, the change refines iterator boundary conditions and improves error handling for stale state updates to ensure consistent progress tracking.

Highlights

  • Atomic Checkpoint Updates: Converted dispatcherStat.checkpointTs to atomic.Uint64 to ensure thread-safe concurrent updates and prevent race conditions.
  • Monotonic Checkpoint Advancement: Implemented a CAS-based loop for subscription checkpoint updates to ensure they only advance monotonically, ignoring stale or out-of-order updates.
  • Iterator Boundary Fix: Updated the iterator span boundary check to be exclusive (<) instead of inclusive (<=) for more accurate key filtering.
  • Stale Update Handling: Added logic to log and ignore stale subscription state updates, improving system stability.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

Summary by CodeRabbit

  • Bug Fixes

    • Stale checkpoint/resolved updates are now logged and ignored instead of causing panics, improving stability.
    • Checkpoint reads/writes made concurrency-safe to reduce races.
  • Changes

    • Iterator span boundary handling changed to use exclusive end-key comparison for more accurate filtering.
    • Subscription checkpoint advancement and GC scheduling now behave more reliably and are reported consistently.
  • Tests

    • Added a concurrency test covering checkpoint update behavior.

Review Change Stack

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 9, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@ti-chi-bot ti-chi-bot Bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. labels May 9, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 9, 2026

📝 Walkthrough

Walkthrough

Dispatcher checkpoint becomes atomic; dispatcher updates use monotonic compare-and-increase. Subscription checkpoint advancement is rewritten as a CAS loop reading dispatcher.Load(), emitting updates only on successful CAS. Iterator EndKey is now exclusive. Uploader ignores stale subscription updates. Tests add a concurrency case and use comparable keys.

Changes

Subscription Checkpoint Race Condition and Iterator Boundary Fix

Layer / File(s) Summary
Atomic Checkpoint Field
logservice/eventstore/event_store.go
dispatcherStat.checkpointTs is converted from uint64 to atomic.Uint64.
Dispatcher Initialization
logservice/eventstore/event_store.go
RegisterDispatcher initializes the atomic checkpoint via Store(startTs).
Checkpoint Read/Write Operations
logservice/eventstore/event_store.go
UpdateDispatcherCheckpointTs advances dispatcher checkpoint with a monotonic compare-and-increase and reads dispatcher checkpoints via Load() when computing subscription candidate checkpoints.
Subscription Checkpoint CAS Loop
logservice/eventstore/event_store.go
subStat.checkpointTs advancement uses a CAS loop that exits on no-op or backward candidate; on CAS success it enqueues GC conditionally and emits SubscriptionChangeTypeUpdate with new checkpoint and current resolved ts.
Stale Update Handling
logservice/eventstore/event_store.go
uploadStatePeriodically logs and ignores stale SubscriptionChangeTypeUpdate entries (decreasing checkpoint/resolved ts) and applies only non-decreasing updates.
Iterator Boundary Filtering
logservice/eventstore/event_store.go
eventStoreIter.Next changes EndKey span check from inclusive (<=) to exclusive (<).
Tests: concurrency and span keys
logservice/eventstore/event_store_test.go
Adds TestEventStoreUpdateCheckpointTsConcurrentStaleUpdates to exercise concurrent stale updates; TestEventStoreIter_NextWithFiltering now uses common.ToComparableKey(...) for span boundaries.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

  • pingcap/ticdc#4953: Related iterator/end-key boundary changes and Pebble scan bound refinements in event_store.go.

Suggested labels

lgtm

Suggested reviewers

  • asddongmen
  • hongyunyan
  • flowbehappy

Poem

🐰 Atomics hum in line,
CAS hops steady, not twice,
Checkpoints climb and never slip,
Boundaries trim the slice,
Tests thump softly, carrot-ice.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'eventstore: fix checkpoint update race' directly and concisely describes the main change: fixing a race condition in checkpoint updates.
Linked Issues check ✅ Passed The PR addresses all primary coding objectives from issue #4992: converting dispatcherStat.checkpointTs to atomic.Uint64, implementing CAS-based monotonic updates, and improving stale update handling.
Out of Scope Changes check ✅ Passed All changes are within scope: atomic checkpoint handling, CAS-based monotonic updates, iterator boundary fix, and stale update logging directly address the race condition issue.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch ldz/optimize-event-store001

Warning

Tools execution failed with the following error:

Failed to run tools: 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error)


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label May 9, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the event store by implementing SST file filtering based on transaction commit timestamp ranges, allowing Pebble to skip irrelevant files during scans. It refactors key encoding and decoding logic, introduces shared Pebble caches with proper lifecycle management, and improves concurrency by using atomic types for checkpoint timestamps. Additionally, it refines the handling of subscription state updates to ensure monotonic advancement. Feedback was provided to improve the clarity of a log message regarding stale state updates.

Comment thread logservice/eventstore/event_store.go
@lidezhu lidezhu force-pushed the ldz/optimize-event-store001 branch from 926bb42 to 4bf750b Compare May 9, 2026 15:50
@ti-chi-bot ti-chi-bot Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels May 9, 2026
@lidezhu lidezhu marked this pull request as ready for review May 9, 2026 15:54
@ti-chi-bot ti-chi-bot Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 9, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@logservice/eventstore/event_store.go`:
- Around line 715-716: The dispatcher checkpoint write at
dispatcherStat.checkpointTs.Store(checkpointTs) can regress a newer
per-dispatcher value; change the update to only advance the per-dispatcher
checkpoint if checkpointTs is greater than the currently stored value by reading
dispatcherStat.checkpointTs (via its Load or atomic read) and using an atomic
compare-and-swap loop to store the new checkpoint only when it is strictly
larger (i.e., retry CAS until success or current >= checkpointTs) so
dispatcherStat.checkpointTs never moves backward and min recomputation reflects
true progress.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fd2ec042-f448-4c1a-ba09-be650be4b428

📥 Commits

Reviewing files that changed from the base of the PR and between fe7febb and 4bf750b.

📒 Files selected for processing (2)
  • logservice/eventstore/event_store.go
  • logservice/eventstore/event_store_test.go

Comment thread logservice/eventstore/event_store.go Outdated
…store001

# Conflicts:
#	logservice/eventstore/event_store.go
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
logservice/eventstore/event_store.go (1)

727-727: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep the per-dispatcher checkpoint monotonic too.

The CAS on subStat.checkpointTs prevents the subscription value from moving backward, but Line 727 can still overwrite a newer dispatcher checkpoint with a stale one. After that, the next min recomputation can stay pinned below real progress, so GC and uploaded subscription state stop advancing until this dispatcher reports again.

Suggested fix
-	dispatcherStat.checkpointTs.Store(checkpointTs)
+	util.CompareAndMonotonicIncrease(&dispatcherStat.checkpointTs, checkpointTs)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@logservice/eventstore/event_store.go` at line 727, The dispatcher checkpoint
write at dispatcherStat.checkpointTs.Store(checkpointTs) can stomp a newer
value; change it to a monotonic CAS update: load the current dispatcher
checkpoint (dispatcherStat.checkpointTs), and only attempt an atomic
compare-and-swap to set it to checkpointTs if checkpointTs is greater than the
loaded value (repeat as needed to handle races). Ensure the same monotonicity
semantics as subStat.checkpointTs (i.e., never move backward) so that
dispatcherStat.checkpointTs only advances when checkpointTs > current.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@logservice/eventstore/event_store.go`:
- Line 727: The dispatcher checkpoint write at
dispatcherStat.checkpointTs.Store(checkpointTs) can stomp a newer value; change
it to a monotonic CAS update: load the current dispatcher checkpoint
(dispatcherStat.checkpointTs), and only attempt an atomic compare-and-swap to
set it to checkpointTs if checkpointTs is greater than the loaded value (repeat
as needed to handle races). Ensure the same monotonicity semantics as
subStat.checkpointTs (i.e., never move backward) so that
dispatcherStat.checkpointTs only advances when checkpointTs > current.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1a62be06-10f1-409a-ab7d-1830a0b00d41

📥 Commits

Reviewing files that changed from the base of the PR and between 4bf750b and d255d01.

📒 Files selected for processing (2)
  • logservice/eventstore/event_store.go
  • logservice/eventstore/event_store_test.go

@lidezhu
Copy link
Copy Markdown
Collaborator Author

lidezhu commented May 26, 2026

/gemini review

@lidezhu
Copy link
Copy Markdown
Collaborator Author

lidezhu commented May 26, 2026

@coderabbitai review

@lidezhu
Copy link
Copy Markdown
Collaborator Author

lidezhu commented May 26, 2026

/gemini summary

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 26, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors checkpointTs in dispatcherStat to use atomic.Uint64 to prevent race conditions during concurrent updates, and updates related logic to use atomic operations. It also replaces a panic with a warning when handling stale subscription state updates and fixes a boundary check in eventStoreIter.Next to use an exclusive upper bound. A new concurrent test is added to verify these changes. The reviewer suggested using CompareAndMonotonicIncrease instead of CompareAndIncrease to avoid executing downstream logic unnecessarily when the checkpoint timestamp does not actually advance.

Comment thread logservice/eventstore/event_store.go
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

This pull request addresses race conditions in the eventstore's checkpoint update mechanism. By transitioning key checkpoint fields to atomic types and implementing monotonic update logic, the system now safely handles concurrent updates and prevents regression of checkpoint timestamps. Additionally, the change refines iterator boundary conditions and improves error handling for stale state updates to ensure consistent progress tracking.

Highlights

  • Atomic Checkpoint Updates: Converted dispatcherStat.checkpointTs to atomic.Uint64 to ensure thread-safe concurrent updates and prevent race conditions.
  • Monotonic Checkpoint Advancement: Implemented a CAS-based loop for subscription checkpoint updates to ensure they only advance monotonically, ignoring stale or out-of-order updates.
  • Iterator Boundary Fix: Updated the iterator span boundary check to be exclusive (<) instead of inclusive (<=) for more accurate key filtering.
  • Stale Update Handling: Added logic to log and ignore stale subscription state updates, improving system stability.
New Features

🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Activity
  • Initial PR creation and draft status.
  • Automated review triggered by CodeRabbit, identifying potential race conditions in dispatcher checkpoint updates.
  • Author implemented atomic operations and CAS loops to address race conditions.
  • Reviewer suggested refining the log message for stale updates and optimizing the checkpoint advancement logic.
  • Author updated the implementation to address review feedback, including adding a concurrent stale update test case.

Comment thread logservice/eventstore/event_store.go
@ti-chi-bot ti-chi-bot Bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels May 26, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@logservice/eventstore/event_store_test.go`:
- Around line 958-965: After wg.Wait() add a deterministic assertion that
subStat.checkpointTs.Load() equals 100 before advancing resolvedTs to ensure
stale concurrent updates didn't move the subscription checkpoint; specifically,
right after close(startCh) and wg.Wait(), assert subStat.checkpointTs.Load() ==
100, then proceed to call subStat.resolvedTs.Store(900) and
store.UpdateDispatcherCheckpointTs(dispatcherID2, 900).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 18d7edca-a0d8-4fd0-acdb-257f2b3492da

📥 Commits

Reviewing files that changed from the base of the PR and between 1daaf35 and 5d1f181.

📒 Files selected for processing (2)
  • logservice/eventstore/event_store.go
  • logservice/eventstore/event_store_test.go

Comment on lines +958 to +965
close(startCh)
wg.Wait()

require.Equal(t, uint64(900), stat1.checkpointTs.Load())

subStat.resolvedTs.Store(900)
store.UpdateDispatcherCheckpointTs(dispatcherID2, 900)
require.Equal(t, uint64(900), subStat.checkpointTs.Load())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert stale concurrent updates did not advance the subscription checkpoint.

Right after wg.Wait(), add an assertion that subStat.checkpointTs is still 100 before resolvedTs is advanced. This closes a gap where premature checkpoint movement could regress without being caught.

Suggested patch
 	close(startCh)
 	wg.Wait()
 
 	require.Equal(t, uint64(900), stat1.checkpointTs.Load())
+	require.Equal(t, uint64(100), subStat.checkpointTs.Load())
 
 	subStat.resolvedTs.Store(900)
 	store.UpdateDispatcherCheckpointTs(dispatcherID2, 900)
 	require.Equal(t, uint64(900), subStat.checkpointTs.Load())

As per coding guidelines, "Prefer focused deterministic tests; see docs/agents/testing.md before adding or changing tests."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
close(startCh)
wg.Wait()
require.Equal(t, uint64(900), stat1.checkpointTs.Load())
subStat.resolvedTs.Store(900)
store.UpdateDispatcherCheckpointTs(dispatcherID2, 900)
require.Equal(t, uint64(900), subStat.checkpointTs.Load())
close(startCh)
wg.Wait()
require.Equal(t, uint64(900), stat1.checkpointTs.Load())
require.Equal(t, uint64(100), subStat.checkpointTs.Load())
subStat.resolvedTs.Store(900)
store.UpdateDispatcherCheckpointTs(dispatcherID2, 900)
require.Equal(t, uint64(900), subStat.checkpointTs.Load())
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@logservice/eventstore/event_store_test.go` around lines 958 - 965, After
wg.Wait() add a deterministic assertion that subStat.checkpointTs.Load() equals
100 before advancing resolvedTs to ensure stale concurrent updates didn't move
the subscription checkpoint; specifically, right after close(startCh) and
wg.Wait(), assert subStat.checkpointTs.Load() == 100, then proceed to call
subStat.resolvedTs.Store(900) and
store.UpdateDispatcherCheckpointTs(dispatcherID2, 900).

if newCheckpointTs < oldCheckpointTs {
return
}
if !subStat.checkpointTs.CompareAndSwap(oldCheckpointTs, newCheckpointTs) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this cause the cdc to get stuck?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the checkpoint only increases, the loop will either succeed or terminate when the condition newCheckpointTs <= oldCheckpointTs is met.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems impossible for the swap to fail here. I think that the log needs to be added

if newCheckpointTs < oldCheckpointTs {
return
}
if !subStat.checkpointTs.CompareAndSwap(oldCheckpointTs, newCheckpointTs) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems impossible for the swap to fail here. I think that the log needs to be added

@ti-chi-bot ti-chi-bot Bot added the lgtm label May 26, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 26, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 3AceShowHand, wk989898

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [3AceShowHand,wk989898]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot removed the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label May 26, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented May 26, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-05-26 02:20:06.603233649 +0000 UTC m=+318676.573398709: ☑️ agreed by 3AceShowHand.
  • 2026-05-26 07:14:34.999756597 +0000 UTC m=+336344.969921654: ☑️ agreed by wk989898.

@ti-chi-bot ti-chi-bot Bot merged commit 99b6a52 into master May 26, 2026
25 checks passed
@ti-chi-bot ti-chi-bot Bot deleted the ldz/optimize-event-store001 branch May 26, 2026 08:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Event store has a race around subscription checkpoint updates

3 participants