feat(service): Implement BigTable-backed ChangeLog for tiered storage crash recovery by jan-auer · Pull Request #413 · getsentry/objectstore

jan-auer · 2026-03-26T12:52:25Z

TieredStorage writes span two backends — a mutation touches both the high-volume (BigTable) and long-term (GCS) tiers. If a process crashes mid-write, neither tier is aware of what the other may have partially received, leaving orphaned blobs with no owning tombstone.

This PR introduces a write-ahead changelog that records each in-flight mutation before work begins and removes it on clean completion. On startup, TieredStorage scans for stale entries (those belonging to crashed operations) and re-runs the cleanup step to converge the system back to a consistent state.

Claiming is race-safe via BigTable CheckAndMutateRow: only the instance that wins the CAS on a stale entry processes it; concurrent recovery instances skip it.
Lazy streaming: the scan uses a server-streaming RPC so rows are claimed as they arrive rather than buffered up front.
Explicit config: changelog backend is a new ChangeLogConfig: Noop | BigTable field on TieredStorageConfig, defaulting to Noop. Existing configs without the field are unaffected.

Not yet implemented: the heartbeat-bump mechanism that refreshes active-operation timestamps during long writes. Without it the staleness threshold (30 s) must be set conservatively. The bump task is the intended follow-up.

Add a BigTable implementation of the ChangeLog trait so that in-flight mutations in TieredStorage are persisted for crash recovery. The ChangeLog records each mutation (object ID, old and new storage locations) before work begins and removes it on clean completion. On startup, TieredStorage scans for stale entries — those whose cell timestamp is older than a staleness threshold — and re-runs the cleanup to converge any partially-completed state. Stale entries are claimed via a BigTable CheckAndMutateRow CAS: an entry is only claimed if its timestamp is still below the threshold, so concurrent recovery instances cannot process the same entry twice. Rows stream lazily from BigTable as the caller polls, with CAS claims interleaved with row delivery. The changelog backend is now explicitly configured via a new ChangeLogConfig enum on TieredStorageConfig (Noop | BigTable), defaulting to Noop. This decouples the changelog from the HV backend config and lets operators opt in independently. The heartbeat-bump mechanism that would keep active-operation timestamps fresh is not yet implemented; the staleness threshold is currently fixed at 30 seconds. Co-Authored-By: Claude <noreply@anthropic.com>

…ounded Vec - Add a heartbeat task spawned per in-flight change that re-records the entry every 10s, keeping it alive in durable storage while cleanup is in progress. The heartbeat is aborted before log removal to prevent re-recording a deleted entry. - Rewrite ChangeManager::recover() as an infinite polling loop: exponential backoff on scan errors, fixed wait on empty results, immediate re-poll after processing entries. - Replace the lazy ChangeStream (BoxStream) with an eager bounded Vec. The scan now accepts a max count, passed as rows_limit to Bigtable's ReadRows RPC, so the streaming connection is closed before cleanup begins. This avoids holding a long-lived gRPC stream open across slow sequential cleanup.

If changelog heartbeats fail consistently, the write-ahead log entry will expire and be claimed by recovery on another instance. Scheduling local cleanup in addition would be redundant and add load during incidents where Bigtable is already degraded — recovery runs cleanup sequentially and in a controlled way. - Heartbeat now returns on the first error (after internal retries); the tracker token is released and the JoinHandle becomes finished. - ChangeGuard::is_valid() exposes whether the heartbeat is still running. - put_long_term checks is_valid() before the HV CAS and aborts if the heartbeat has stopped, preventing the race where recovery reads a pre-CAS tombstone and deletes the newly uploaded LT blob. - ChangeGuard::drop skips scheduling local cleanup when is_valid() is false, leaving the log entry for recovery to handle.

jan-auer and others added 3 commits March 26, 2026 13:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(service): Implement BigTable-backed ChangeLog for tiered storage crash recovery#413

feat(service): Implement BigTable-backed ChangeLog for tiered storage crash recovery#413
jan-auer wants to merge 3 commits intomainfrom
worktree-changelog-bigtable

jan-auer commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jan-auer commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant