chore(ci): try Swatinem/rust-cache for target/ caching by pront · Pull Request #25478 · vectordotdev/vector

pront · 2026-05-20T18:28:01Z

Status: paused

Exploration / prototype. Do not merge as-is — see Pre-merge cleanup below. Findings recorded here for whoever picks this up next; primary follow-up is to evaluate sccache + a Datadog-internal S3 cache instead of GHA cache (see Next steps).

Summary

Replaces the hand-rolled actions/cache step in .github/actions/setup/action.yml with Swatinem/rust-cache@v2.9.1, which caches ~/.cargo AND target/ build artifacts. The current cache step only caches ~/.cargo/registry, so every Vector CI job does a cold cargo build today.

Configuration:

prefix-key: "v1-vector" — explicit cache version for easy invalidation
Per-job cache partitioning (Swatinem's default — no shared-key); attempting a single shared key races between jobs and only the lightest job wins
save-if: true (TEMP) — needs revert to github.ref == 'refs/heads/master' before merging

Measurements

Empirical per-job cache entry sizes from this PR's CI runs:

Job	Size	Notes
check-fmt	486 MB	registry only, no real compile
check-events	486 MB	registry only
build-vrl-playground	709 MB
test-vrl	1.16 GB
check-clippy	1.38 GB
check-rust-docs	1.47 GB
tests (workspace)	2.35 GB
check-generated-docs	2.55 GB	heaviest single job
Sum	~10.1 GB

Cache HIT was confirmed. On a second push to this branch (no Cargo.lock change), check-clippy reported:
Cache hit for: v1-vector-check-clippy-Linux-x64-f996a9aa-c3e9f435
Restored from cache key ... full match: true.

Cargo's first activity after restore was workspace-local code (fakedata, vector-vrl/web-playground) instead of cold-start third-party crates (portable-atomic, critical-section, itoa). Compile time for check-clippy dropped from ~6 min cold → ~2 min warm.

Limitations (the reason this is paused)

The 10 GB-per-repo GHA cache cap is the binding constraint.

Sum of Vector's per-job entries (~10 GB) saturates the entire repo budget on one CI cycle. Between the two pushes on this branch (~40 min apart), LRU eviction killed 2 of 7 entries, including the largest one (check-generated-docs at 2.55 GB). Cache works at the macro level but individual entries are at constant eviction risk.

Pre-PR repo state was already at 91% saturation: 11.56 GB across 62 caches, with 26 redundant Linux-cargo-* entries (~9.8 GB) because the current cache key includes hashFiles('**/Cargo.lock') and every Cargo.lock variant created a new ~400 MB save.

After this PR's runs, repo cache settled to 9.88 GB across 5 caches — Swatinem's pruning and unified per-job entries replaced the noise.

What works in steady state with `save-if: refs/heads/master`

Master pushes save the cache (saturated, ~10 GB total).
PR jobs read master's cache without writing — no per-PR explosion.
PRs that don't touch Cargo.lock get full hits on the surviving entries; only the touched workspace members recompile.
1-2 entries per PR likely miss and rebuild cold (whichever was at the eviction frontier).

What doesn't work

All caches co-resident: impossible. We always lose 1-2 entries between master pushes.
Cargo.lock churn on master: every change creates a new ~2-3 GB save per job per OS, evicting older entries. Vector's typical 5-15 lockfile changes/week saturate quickly.
Cross-OS caching (macOS): not yet covered; would double the budget pressure.
cross.yml: still cold (Docker-based cross-compile, host cache doesn't reach the container). Confirmed out of scope per discussion — release builds always cold.

Comparison vs status quo

	Status quo (today's master)	This PR
Cache action	`actions/cache@v5`	`Swatinem/rust-cache@v2.9.1`
Cached paths	`~/.cargo/registry/{index,cache}`, `~/.cargo/git/db`	All of those + `~/.cargo/bin` + `target/`
Key strategy	`${runner.os}-cargo-${hash(Cargo.lock)}` — creates one ~400 MB entry per Cargo.lock variant	Per-job, lockfile-hashed, prefix-keyed; Swatinem prunes before save
Active entries observed	62 (26 of which are stale `Linux-cargo-*` near-duplicates totalling ~9.8 GB)	5 useful per-job entries totalling ~9.88 GB
target/ cached	No — every CI job does a cold `cargo build`	Yes
Cold PR build time	~15-20 min	unchanged
Warm PR build time	n/a (always cold)	~3-5 min for src-only changes
Goes over 10 GB cap	Yes (currently 91% from stale entries)	Yes (saturated by useful entries, LRU evicts)

Pre-merge cleanup (when resuming)

Revert save-if: true to save-if: ${{ github.ref == 'refs/heads/master' }} in .github/actions/setup/action.yml.
Revert the src/main.rs source-touch added to trigger the changes filter. (The filter at .github/workflows/changes.yml doesn't include .github/actions/setup/**; possibly worth adding as a follow-up so this hack isn't needed next time.)

Manually delete stale Linux-cargo-* entries before merging so the new Swatinem cache has room to seed cleanly:

gh cache list --repo vectordotdev/vector --key "Linux-cargo-" --limit 100 --json id -q '.[].id' \
  | xargs -I{} gh cache delete {} --repo vectordotdev/vector

Restore the original commit history if desired (current branch carries iterative debug commits — squash before merging).

Next steps (not in scope for this PR)

The right long-term answer for Vector's cache budget is to move sccache to a Datadog-internal S3 backend, removing the 10 GB GHA cap entirely. The bucket dd-sccache-storage-us1-ddbuild-io already exists (used by observability-pipelines-worker on Datadog's GitLab CI per their .gitlab-ci.yml), and is documented in the Confluence page 2025-07-02 - S3 Lifecycle Policy Exploration for sccache Storage. The bucket uses a 30-day TTL.

Blocker for Vector to use it from GHA-hosted runners: needs an IAM role + OIDC trust policy in the build-stable AWS account that allows repo:vectordotdev/vector:*. That requires coordination with the build-stable team (Bruce Guenter's group per the Confluence page).

Once that's in place, follow-up PR replaces Swatinem entirely with:

- uses: aws-actions/configure-aws-credentials@<sha>
  with:
    role-to-assume: arn:aws:iam::<acct>:role/vector-sccache-readwrite
    aws-region: us-east-1
- uses: mozilla-actions/sccache-action@<sha>
- run: |
    {
      echo "SCCACHE_BUCKET=dd-sccache-storage-us1-ddbuild-io"
      echo "SCCACHE_REGION=us-east-1"
      echo "SCCACHE_S3_KEY_PREFIX=vector"
      echo "SCCACHE_S3_USE_SSL=true"
      echo "RUSTC_WRAPPER=sccache"
      echo "CC=sccache cc"
      echo "CXX=sccache c++"
    } >> "\$GITHUB_ENV"

One conflict to resolve: RUSTC_WRAPPER=sccache collides with the mold wrapper at .github/actions/setup/action.yml:153-198. Options: drop the mold wrapper in favor of RUSTFLAGS="-C link-arg=-fuse-ld=mold", or use update-alternatives for a system-level linker swap. Either is cleaner than the current wrapper-script approach and removes mold from Cargo's fingerprint inputs.

Change Type

Non-functional (chore, refactoring, docs)

Is this a breaking change?

No

Does this PR include user facing changes?

No. A maintainer will apply the no-changelog label to this PR.

🤖 Generated with Claude Code

Replace the hand-rolled actions/cache step (which only cached ~/.cargo/registry) with Swatinem/rust-cache@v2.9.1, which also caches target/ build artifacts and prunes them intelligently before save: removes incremental artifacts, deps no longer in Cargo.lock, artifacts older than ~1 week, and ~/.cargo/registry/src (recreated from archives on restore). Configuration: - shared-key: "vector-<os>" — single OS-shared cache instead of one per Cargo.lock hash (today's pattern produces ~26 active entries at ~400 MB each, exhausting the 10 GB GHA cache budget). - save-if: refs/heads/master — only master pushes save the cache. PR jobs restore from master's cache but don't write, so PR churn no longer creates per-PR cache entries. Expected behavior on a PR with src/ changes only: Cargo fingerprints match for all unchanged third-party crates (rdkafka-sys, openssl-sys, zstd-sys, …), so only the touched workspace members recompile. Cold build time on cached PRs should drop from ~20m to ~3-5m. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

So this draft branch can produce a measurable cache hit on its own second push. Revert before merging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`key:` is additive — Swatinem appends rustc-hash + lockfile-hash to it regardless. `shared-key:` is the documented way to partition only by OS (still gets the rustc/lockfile suffixes appended, but that's the intended default behavior). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The source-change filter in .github/workflows/changes.yml doesn't include .github/actions/setup/**, so this PR's CI run skipped all the Rust-building workflows that exercise the cache. Touching src/main.rs with a comment flips the source filter on. Revert before merging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Swatinem appends `${runnerOS}-${runnerArch}` automatically. Including runner.os in shared-key produced keys like `vector-Linux-Linux-x64-...`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

shared-key forced multiple Rust CI jobs (check-fmt, check-clippy, Build, …) to compete for the same cache entry on save. The lightest job (check-fmt) consistently won the reservation lock and persisted its near-empty target/ (~486 MB, mostly registry). Heavier jobs that compiled the full workspace got rejected: Failed to save: Unable to reserve cache with key v0-rust-vector-Linux-x64-..., another job may be creating this cache. Letting Swatinem use its default job-id partitioning produces one entry per job, with no contention. Per-entry size will vary by job (clippy and Build will produce multi-GB entries; deny/fmt stay small). Trade off is more entries vs. correct per-job artifact reuse. Also bumps prefix-key to v1-vector so old shared-key entries don't shadow the new job-partitioned ones during transition. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Confirms whether check-clippy / tests cache restores actually hit (i.e. saved entries from prior run survive eviction). Will revert along with the source touch before merging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions Bot added the domain: ci Anything related to Vector's CI environment label May 20, 2026

pront and others added 3 commits May 20, 2026 14:44

chore(ci): temp - save cache on all refs to seed prototype

2e02c05

So this draft branch can produce a measurable cache hit on its own second push. Revert before merging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pront added the no-changelog Changes in this PR do not need user-facing explanations in the release changelog label May 20, 2026

pront and others added 3 commits May 20, 2026 14:54

chore(ci): drop redundant runner.os from shared-key

58f9daa

Swatinem appends `${runnerOS}-${runnerArch}` automatically. Including runner.os in shared-key produced keys like `vector-Linux-Linux-x64-...`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

pront added the do not merge label May 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(ci): try Swatinem/rust-cache for target/ caching#25478

chore(ci): try Swatinem/rust-cache for target/ caching#25478
pront wants to merge 7 commits into
masterfrom
explore/rust-cache

pront commented May 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pront commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Status: paused

Summary

Measurements

Limitations (the reason this is paused)

What works in steady state with save-if: refs/heads/master

What doesn't work

Comparison vs status quo

Pre-merge cleanup (when resuming)

Next steps (not in scope for this PR)

Change Type

Is this a breaking change?

Does this PR include user facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pront commented May 20, 2026 •

edited

Loading

What works in steady state with `save-if: refs/heads/master`