Skip to content

chore(ci): try Swatinem/rust-cache for target/ caching#25478

Draft
pront wants to merge 7 commits into
masterfrom
explore/rust-cache
Draft

chore(ci): try Swatinem/rust-cache for target/ caching#25478
pront wants to merge 7 commits into
masterfrom
explore/rust-cache

Conversation

@pront
Copy link
Copy Markdown
Member

@pront pront commented May 20, 2026

Status: paused

Exploration / prototype. Do not merge as-is — see Pre-merge cleanup below. Findings recorded here for whoever picks this up next; primary follow-up is to evaluate sccache + a Datadog-internal S3 cache instead of GHA cache (see Next steps).

Summary

Replaces the hand-rolled actions/cache step in .github/actions/setup/action.yml with Swatinem/rust-cache@v2.9.1, which caches ~/.cargo AND target/ build artifacts. The current cache step only caches ~/.cargo/registry, so every Vector CI job does a cold cargo build today.

Configuration:

  • prefix-key: "v1-vector" — explicit cache version for easy invalidation
  • Per-job cache partitioning (Swatinem's default — no shared-key); attempting a single shared key races between jobs and only the lightest job wins
  • save-if: true (TEMP) — needs revert to github.ref == 'refs/heads/master' before merging

Measurements

Empirical per-job cache entry sizes from this PR's CI runs:

Job Size Notes
check-fmt 486 MB registry only, no real compile
check-events 486 MB registry only
build-vrl-playground 709 MB
test-vrl 1.16 GB
check-clippy 1.38 GB
check-rust-docs 1.47 GB
tests (workspace) 2.35 GB
check-generated-docs 2.55 GB heaviest single job
Sum ~10.1 GB

Cache HIT was confirmed. On a second push to this branch (no Cargo.lock change), check-clippy reported:
Cache hit for: v1-vector-check-clippy-Linux-x64-f996a9aa-c3e9f435
Restored from cache key ... full match: true.

Cargo's first activity after restore was workspace-local code (fakedata, vector-vrl/web-playground) instead of cold-start third-party crates (portable-atomic, critical-section, itoa). Compile time for check-clippy dropped from ~6 min cold → ~2 min warm.

Limitations (the reason this is paused)

The 10 GB-per-repo GHA cache cap is the binding constraint.

Sum of Vector's per-job entries (~10 GB) saturates the entire repo budget on one CI cycle. Between the two pushes on this branch (~40 min apart), LRU eviction killed 2 of 7 entries, including the largest one (check-generated-docs at 2.55 GB). Cache works at the macro level but individual entries are at constant eviction risk.

Pre-PR repo state was already at 91% saturation: 11.56 GB across 62 caches, with 26 redundant Linux-cargo-* entries (~9.8 GB) because the current cache key includes hashFiles('**/Cargo.lock') and every Cargo.lock variant created a new ~400 MB save.

After this PR's runs, repo cache settled to 9.88 GB across 5 caches — Swatinem's pruning and unified per-job entries replaced the noise.

What works in steady state with save-if: refs/heads/master

  • Master pushes save the cache (saturated, ~10 GB total).
  • PR jobs read master's cache without writing — no per-PR explosion.
  • PRs that don't touch Cargo.lock get full hits on the surviving entries; only the touched workspace members recompile.
  • 1-2 entries per PR likely miss and rebuild cold (whichever was at the eviction frontier).

What doesn't work

  • All caches co-resident: impossible. We always lose 1-2 entries between master pushes.
  • Cargo.lock churn on master: every change creates a new ~2-3 GB save per job per OS, evicting older entries. Vector's typical 5-15 lockfile changes/week saturate quickly.
  • Cross-OS caching (macOS): not yet covered; would double the budget pressure.
  • cross.yml: still cold (Docker-based cross-compile, host cache doesn't reach the container). Confirmed out of scope per discussion — release builds always cold.

Comparison vs status quo

Status quo (today's master) This PR
Cache action actions/cache@v5 Swatinem/rust-cache@v2.9.1
Cached paths ~/.cargo/registry/{index,cache}, ~/.cargo/git/db All of those + ~/.cargo/bin + target/
Key strategy ${runner.os}-cargo-${hash(Cargo.lock)} — creates one ~400 MB entry per Cargo.lock variant Per-job, lockfile-hashed, prefix-keyed; Swatinem prunes before save
Active entries observed 62 (26 of which are stale Linux-cargo-* near-duplicates totalling ~9.8 GB) 5 useful per-job entries totalling ~9.88 GB
target/ cached No — every CI job does a cold cargo build Yes
Cold PR build time ~15-20 min unchanged
Warm PR build time n/a (always cold) ~3-5 min for src-only changes
Goes over 10 GB cap Yes (currently 91% from stale entries) Yes (saturated by useful entries, LRU evicts)

Pre-merge cleanup (when resuming)

  1. Revert save-if: true to save-if: ${{ github.ref == 'refs/heads/master' }} in .github/actions/setup/action.yml.
  2. Revert the src/main.rs source-touch added to trigger the changes filter. (The filter at .github/workflows/changes.yml doesn't include .github/actions/setup/**; possibly worth adding as a follow-up so this hack isn't needed next time.)
  3. Manually delete stale Linux-cargo-* entries before merging so the new Swatinem cache has room to seed cleanly:
    gh cache list --repo vectordotdev/vector --key "Linux-cargo-" --limit 100 --json id -q '.[].id' \
      | xargs -I{} gh cache delete {} --repo vectordotdev/vector
  4. Restore the original commit history if desired (current branch carries iterative debug commits — squash before merging).

Next steps (not in scope for this PR)

The right long-term answer for Vector's cache budget is to move sccache to a Datadog-internal S3 backend, removing the 10 GB GHA cap entirely. The bucket dd-sccache-storage-us1-ddbuild-io already exists (used by observability-pipelines-worker on Datadog's GitLab CI per their .gitlab-ci.yml), and is documented in the Confluence page 2025-07-02 - S3 Lifecycle Policy Exploration for sccache Storage. The bucket uses a 30-day TTL.

Blocker for Vector to use it from GHA-hosted runners: needs an IAM role + OIDC trust policy in the build-stable AWS account that allows repo:vectordotdev/vector:*. That requires coordination with the build-stable team (Bruce Guenter's group per the Confluence page).

Once that's in place, follow-up PR replaces Swatinem entirely with:

- uses: aws-actions/configure-aws-credentials@<sha>
  with:
    role-to-assume: arn:aws:iam::<acct>:role/vector-sccache-readwrite
    aws-region: us-east-1
- uses: mozilla-actions/sccache-action@<sha>
- run: |
    {
      echo "SCCACHE_BUCKET=dd-sccache-storage-us1-ddbuild-io"
      echo "SCCACHE_REGION=us-east-1"
      echo "SCCACHE_S3_KEY_PREFIX=vector"
      echo "SCCACHE_S3_USE_SSL=true"
      echo "RUSTC_WRAPPER=sccache"
      echo "CC=sccache cc"
      echo "CXX=sccache c++"
    } >> "\$GITHUB_ENV"

One conflict to resolve: RUSTC_WRAPPER=sccache collides with the mold wrapper at .github/actions/setup/action.yml:153-198. Options: drop the mold wrapper in favor of RUSTFLAGS="-C link-arg=-fuse-ld=mold", or use update-alternatives for a system-level linker swap. Either is cleaner than the current wrapper-script approach and removes mold from Cargo's fingerprint inputs.

Change Type

  • Non-functional (chore, refactoring, docs)

Is this a breaking change?

  • No

Does this PR include user facing changes?

  • No. A maintainer will apply the no-changelog label to this PR.

🤖 Generated with Claude Code

Replace the hand-rolled actions/cache step (which only cached
~/.cargo/registry) with Swatinem/rust-cache@v2.9.1, which also caches
target/ build artifacts and prunes them intelligently before save:
removes incremental artifacts, deps no longer in Cargo.lock, artifacts
older than ~1 week, and ~/.cargo/registry/src (recreated from archives
on restore).

Configuration:
- shared-key: "vector-<os>" — single OS-shared cache instead of one per
  Cargo.lock hash (today's pattern produces ~26 active entries at
  ~400 MB each, exhausting the 10 GB GHA cache budget).
- save-if: refs/heads/master — only master pushes save the cache.
  PR jobs restore from master's cache but don't write, so PR churn no
  longer creates per-PR cache entries.

Expected behavior on a PR with src/ changes only: Cargo fingerprints
match for all unchanged third-party crates (rdkafka-sys, openssl-sys,
zstd-sys, …), so only the touched workspace members recompile. Cold
build time on cached PRs should drop from ~20m to ~3-5m.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the domain: ci Anything related to Vector's CI environment label May 20, 2026
pront and others added 3 commits May 20, 2026 14:44
So this draft branch can produce a measurable cache hit on its own
second push. Revert before merging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`key:` is additive — Swatinem appends rustc-hash + lockfile-hash to it
regardless. `shared-key:` is the documented way to partition only by
OS (still gets the rustc/lockfile suffixes appended, but that's the
intended default behavior).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The source-change filter in .github/workflows/changes.yml doesn't
include .github/actions/setup/**, so this PR's CI run skipped all the
Rust-building workflows that exercise the cache. Touching src/main.rs
with a comment flips the source filter on. Revert before merging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pront pront added the no-changelog Changes in this PR do not need user-facing explanations in the release changelog label May 20, 2026
pront and others added 3 commits May 20, 2026 14:54
Swatinem appends `${runnerOS}-${runnerArch}` automatically. Including
runner.os in shared-key produced keys like `vector-Linux-Linux-x64-...`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
shared-key forced multiple Rust CI jobs (check-fmt, check-clippy, Build,
…) to compete for the same cache entry on save. The lightest job
(check-fmt) consistently won the reservation lock and persisted its
near-empty target/ (~486 MB, mostly registry). Heavier jobs that
compiled the full workspace got rejected:

  Failed to save: Unable to reserve cache with key
  v0-rust-vector-Linux-x64-..., another job may be creating this cache.

Letting Swatinem use its default job-id partitioning produces one entry
per job, with no contention. Per-entry size will vary by job (clippy
and Build will produce multi-GB entries; deny/fmt stay small). Trade
off is more entries vs. correct per-job artifact reuse.

Also bumps prefix-key to v1-vector so old shared-key entries don't
shadow the new job-partitioned ones during transition.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Confirms whether check-clippy / tests cache restores actually hit
(i.e. saved entries from prior run survive eviction). Will revert
along with the source touch before merging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do not merge domain: ci Anything related to Vector's CI environment no-changelog Changes in this PR do not need user-facing explanations in the release changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant