feat(service): Implement BigTable-backed ChangeLog for tiered storage crash recovery#413
Draft
feat(service): Implement BigTable-backed ChangeLog for tiered storage crash recovery#413
Conversation
Add a BigTable implementation of the ChangeLog trait so that in-flight mutations in TieredStorage are persisted for crash recovery. The ChangeLog records each mutation (object ID, old and new storage locations) before work begins and removes it on clean completion. On startup, TieredStorage scans for stale entries — those whose cell timestamp is older than a staleness threshold — and re-runs the cleanup to converge any partially-completed state. Stale entries are claimed via a BigTable CheckAndMutateRow CAS: an entry is only claimed if its timestamp is still below the threshold, so concurrent recovery instances cannot process the same entry twice. Rows stream lazily from BigTable as the caller polls, with CAS claims interleaved with row delivery. The changelog backend is now explicitly configured via a new ChangeLogConfig enum on TieredStorageConfig (Noop | BigTable), defaulting to Noop. This decouples the changelog from the HV backend config and lets operators opt in independently. The heartbeat-bump mechanism that would keep active-operation timestamps fresh is not yet implemented; the staleness threshold is currently fixed at 30 seconds. Co-Authored-By: Claude <noreply@anthropic.com>
…ounded Vec - Add a heartbeat task spawned per in-flight change that re-records the entry every 10s, keeping it alive in durable storage while cleanup is in progress. The heartbeat is aborted before log removal to prevent re-recording a deleted entry. - Rewrite ChangeManager::recover() as an infinite polling loop: exponential backoff on scan errors, fixed wait on empty results, immediate re-poll after processing entries. - Replace the lazy ChangeStream (BoxStream) with an eager bounded Vec. The scan now accepts a max count, passed as rows_limit to Bigtable's ReadRows RPC, so the streaming connection is closed before cleanup begins. This avoids holding a long-lived gRPC stream open across slow sequential cleanup.
If changelog heartbeats fail consistently, the write-ahead log entry will expire and be claimed by recovery on another instance. Scheduling local cleanup in addition would be redundant and add load during incidents where Bigtable is already degraded — recovery runs cleanup sequentially and in a controlled way. - Heartbeat now returns on the first error (after internal retries); the tracker token is released and the JoinHandle becomes finished. - ChangeGuard::is_valid() exposes whether the heartbeat is still running. - put_long_term checks is_valid() before the HV CAS and aborts if the heartbeat has stopped, preventing the race where recovery reads a pre-CAS tombstone and deletes the newly uploaded LT blob. - ChangeGuard::drop skips scheduling local cleanup when is_valid() is false, leaving the log entry for recovery to handle.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TieredStorage writes span two backends — a mutation touches both the high-volume (BigTable) and long-term (GCS) tiers. If a process crashes mid-write, neither tier is aware of what the other may have partially received, leaving orphaned blobs with no owning tombstone.
This PR introduces a write-ahead changelog that records each in-flight mutation before work begins and removes it on clean completion. On startup, TieredStorage scans for stale entries (those belonging to crashed operations) and re-runs the cleanup step to converge the system back to a consistent state.
CheckAndMutateRow: only the instance that wins the CAS on a stale entry processes it; concurrent recovery instances skip it.ChangeLogConfig: Noop | BigTablefield onTieredStorageConfig, defaulting toNoop. Existing configs without the field are unaffected.Not yet implemented: the heartbeat-bump mechanism that refreshes active-operation timestamps during long writes. Without it the staleness threshold (30 s) must be set conservatively. The bump task is the intended follow-up.