feat(service): Add changelog heartbeats and continuous recovery loop by jan-auer · Pull Request #414 · getsentry/objectstore

jan-auer · 2026-03-27T10:42:55Z

The recovery loop now runs continuously rather than as a one-shot scan at startup. On each iteration it claims and cleans up stale entries, then immediately loops if there were entries (in case there are more), or waits one refresh interval before polling again.

A concurrent recovery instance could read HV state before the in-progress CAS commits, see the old tombstone, and delete the newly uploaded LT blob. To prevent this, each in-flight change runs a background heartbeat that re-records its log entry every 10s. The staleness threshold is 3× that interval, so any live operation keeps its entry out of recovery's reach.
Before writing the tombstone after an LT upload, the operation checks whether its heartbeat is still alive (is_valid()) and aborts if not. This avoids inconsistencies if another instance picks up this change for recovery.
When the heartbeat fails, we skip spawning local cleanup on Drop and leave the log entry for recovery to claim. Attempting cleanup locally during a Bigtable outage would race with recovery on other instances, which would keep rediscovering the same entries and accumulating redundant work.

- Replace one-shot recover() with an infinite loop that polls for stale entries, processes them, then waits before polling again. - Add a background heartbeat task per in-flight change that re-records the entry every 10s, keeping its timestamp fresh and preventing recovery from claiming it prematurely. - Gate the HV CAS on heartbeat liveness: abort with an error if the heartbeat has stopped, preventing a race where a recovering instance reads stale HV state and deletes a just-uploaded LT blob. - Skip cleanup on Drop when the heartbeat has already failed, deferring to recovery rather than racing concurrent cleanup during outages.

jan-auer · 2026-03-27T10:45:58Z

objectstore-service/src/backend/changelog.rs

    }
+
+    /// Returns `true` if the change entry is still held by this instance.
+    fn is_valid(&self) -> bool {


I don't like this and would like to try a tokio watch based alternative.

jan-auer commented Mar 27, 2026

View reviewed changes

fix(service): Fix broken intra-doc link in recover() doc comment

7821e6f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(service): Add changelog heartbeats and continuous recovery loop#414

feat(service): Add changelog heartbeats and continuous recovery loop#414
jan-auer wants to merge 2 commits intomainfrom
worktree-recover-loop

jan-auer commented Mar 27, 2026 •

edited

Loading

Uh oh!

jan-auer Mar 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jan-auer commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jan-auer Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jan-auer commented Mar 27, 2026 •

edited

Loading

jan-auer Mar 27, 2026 •

edited

Loading