Skip to content

feat(service): Implement BigTable-backed ChangeLog for tiered storage crash recovery#413

Draft
jan-auer wants to merge 3 commits intomainfrom
worktree-changelog-bigtable
Draft

feat(service): Implement BigTable-backed ChangeLog for tiered storage crash recovery#413
jan-auer wants to merge 3 commits intomainfrom
worktree-changelog-bigtable

Conversation

@jan-auer
Copy link
Copy Markdown
Member

TieredStorage writes span two backends — a mutation touches both the high-volume (BigTable) and long-term (GCS) tiers. If a process crashes mid-write, neither tier is aware of what the other may have partially received, leaving orphaned blobs with no owning tombstone.

This PR introduces a write-ahead changelog that records each in-flight mutation before work begins and removes it on clean completion. On startup, TieredStorage scans for stale entries (those belonging to crashed operations) and re-runs the cleanup step to converge the system back to a consistent state.

  • Claiming is race-safe via BigTable CheckAndMutateRow: only the instance that wins the CAS on a stale entry processes it; concurrent recovery instances skip it.
  • Lazy streaming: the scan uses a server-streaming RPC so rows are claimed as they arrive rather than buffered up front.
  • Explicit config: changelog backend is a new ChangeLogConfig: Noop | BigTable field on TieredStorageConfig, defaulting to Noop. Existing configs without the field are unaffected.

Not yet implemented: the heartbeat-bump mechanism that refreshes active-operation timestamps during long writes. Without it the staleness threshold (30 s) must be set conservatively. The bump task is the intended follow-up.

jan-auer and others added 3 commits March 26, 2026 13:47
Add a BigTable implementation of the ChangeLog trait so that in-flight
mutations in TieredStorage are persisted for crash recovery.

The ChangeLog records each mutation (object ID, old and new storage
locations) before work begins and removes it on clean completion. On
startup, TieredStorage scans for stale entries — those whose cell
timestamp is older than a staleness threshold — and re-runs the
cleanup to converge any partially-completed state.

Stale entries are claimed via a BigTable CheckAndMutateRow CAS: an
entry is only claimed if its timestamp is still below the threshold,
so concurrent recovery instances cannot process the same entry twice.
Rows stream lazily from BigTable as the caller polls, with CAS claims
interleaved with row delivery.

The changelog backend is now explicitly configured via a new
ChangeLogConfig enum on TieredStorageConfig (Noop | BigTable),
defaulting to Noop. This decouples the changelog from the HV backend
config and lets operators opt in independently.

The heartbeat-bump mechanism that would keep active-operation
timestamps fresh is not yet implemented; the staleness threshold is
currently fixed at 30 seconds.

Co-Authored-By: Claude <noreply@anthropic.com>
…ounded Vec

- Add a heartbeat task spawned per in-flight change that re-records the entry
  every 10s, keeping it alive in durable storage while cleanup is in progress.
  The heartbeat is aborted before log removal to prevent re-recording a deleted
  entry.

- Rewrite ChangeManager::recover() as an infinite polling loop: exponential
  backoff on scan errors, fixed wait on empty results, immediate re-poll after
  processing entries.

- Replace the lazy ChangeStream (BoxStream) with an eager bounded Vec. The scan
  now accepts a max count, passed as rows_limit to Bigtable's ReadRows RPC, so
  the streaming connection is closed before cleanup begins. This avoids holding
  a long-lived gRPC stream open across slow sequential cleanup.
If changelog heartbeats fail consistently, the write-ahead log entry will
expire and be claimed by recovery on another instance. Scheduling local
cleanup in addition would be redundant and add load during incidents where
Bigtable is already degraded — recovery runs cleanup sequentially and in
a controlled way.

- Heartbeat now returns on the first error (after internal retries); the
  tracker token is released and the JoinHandle becomes finished.
- ChangeGuard::is_valid() exposes whether the heartbeat is still running.
- put_long_term checks is_valid() before the HV CAS and aborts if the
  heartbeat has stopped, preventing the race where recovery reads a
  pre-CAS tombstone and deletes the newly uploaded LT blob.
- ChangeGuard::drop skips scheduling local cleanup when is_valid() is
  false, leaving the log entry for recovery to handle.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant