Skip to content

feat(service): Add changelog heartbeats and continuous recovery loop#414

Draft
jan-auer wants to merge 2 commits intomainfrom
worktree-recover-loop
Draft

feat(service): Add changelog heartbeats and continuous recovery loop#414
jan-auer wants to merge 2 commits intomainfrom
worktree-recover-loop

Conversation

@jan-auer
Copy link
Copy Markdown
Member

@jan-auer jan-auer commented Mar 27, 2026

The recovery loop now runs continuously rather than as a one-shot scan at startup. On each iteration it claims and cleans up stale entries, then immediately loops if there were entries (in case there are more), or waits one refresh interval before polling again.

  • A concurrent recovery instance could read HV state before the in-progress CAS commits, see the old tombstone, and delete the newly uploaded LT blob. To prevent this, each in-flight change runs a background heartbeat that re-records its log entry every 10s. The staleness threshold is 3× that interval, so any live operation keeps its entry out of recovery's reach.
  • Before writing the tombstone after an LT upload, the operation checks whether its heartbeat is still alive (is_valid()) and aborts if not. This avoids inconsistencies if another instance picks up this change for recovery.
  • When the heartbeat fails, we skip spawning local cleanup on Drop and leave the log entry for recovery to claim. Attempting cleanup locally during a Bigtable outage would race with recovery on other instances, which would keep rediscovering the same entries and accumulating redundant work.

- Replace one-shot recover() with an infinite loop that polls for stale
  entries, processes them, then waits before polling again.
- Add a background heartbeat task per in-flight change that re-records
  the entry every 10s, keeping its timestamp fresh and preventing
  recovery from claiming it prematurely.
- Gate the HV CAS on heartbeat liveness: abort with an error if the
  heartbeat has stopped, preventing a race where a recovering instance
  reads stale HV state and deletes a just-uploaded LT blob.
- Skip cleanup on Drop when the heartbeat has already failed, deferring
  to recovery rather than racing concurrent cleanup during outages.
}

/// Returns `true` if the change entry is still held by this instance.
fn is_valid(&self) -> bool {
Copy link
Copy Markdown
Member Author

@jan-auer jan-auer Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this and would like to try a tokio watch based alternative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant