Skip to content

fix(ci): fail loudly on disabled-cache test cache poisoning#23702

Closed
AztecBot wants to merge 1 commit into
merge-train/fairiesfrom
cb/fatal-disabled-cache-skip
Closed

fix(ci): fail loudly on disabled-cache test cache poisoning#23702
AztecBot wants to merge 1 commit into
merge-train/fairiesfrom
cb/fatal-disabled-cache-skip

Conversation

@AztecBot

Copy link
Copy Markdown
Collaborator

Problem

A disabled-cache test command's success-cache key is a constant string (hash_str_orig "disabled-cache run_test.sh …"), not a content hash. So once run_test_cmd wrote a pass into the success cache (SETEX $key …), every later run hashed the same constant string, hit the cache, and printed SKIPPED — the test never ran again. This is what skipped all TXE tests on merge-train/fairies: PR #23106 logged 709 SKIPPED / 0 executed (http://ci.aztec-labs.com/b893262917583a30).

This is the same class of failure Charlie flagged in #team-fairies: a disabled-cache marker should never silently win — it should be loud.

Fix — make it fatal in CI, in both cases

There are two ways a disabled-cache marker can reach the test cache, and both should be fatal in CI rather than silently disabling a test:

  1. Untracked/modified files during a CI run (ci3/cache_content_hash) — *already_ hard-fails (exit 1) when CI=1. No change needed.
  2. A poisoned success-cache entry for a disabled-cache command (ci3/run_test_cmd) — previously served as a silent SKIPPED. This PR makes it fatal:
    • Write side: never write a success-cache entry for a disabled-cache command, so the cache can no longer be poisoned.
    • Read side: if a poisoned entry is nonetheless present, purge it and, in CI, exit 1 with a clear error instead of skipping.
ci3/run_test_cmd | +15 -3

Relationship to #23658

This supersedes the run_test_cmd change in #23658. That PR made the poisoned entry a silent bypass (run the test anyway); per the thread, the team wants it to be a fatal error. Same one line, opposite resolution — #23658 should be closed or rebased onto this so the two don't conflict when the train merges up. Everything else in #23658's diagnosis is correct.

Rollout note

The poisoned keys are constant and currently live in redis with up to a 7‑day TTL. On the first CI run after this lands, each poisoned disabled-cache command will fatal once and purge its own key; subsequent runs are clean and the tests run normally. If you'd rather avoid the one-time red, flush the affected redis keys at deploy time — happy to provide the key pattern.

Testing

bash -n ci3/run_test_cmd passes. The change is gated on CI=1 / USE_TEST_CACHE=1 / CI_REDIS_AVAILABLE=1, which aren't reproducible locally without the CI redis; behavior is otherwise unchanged for non-disabled-cache commands.


Created by claudebox · group: slackbot

A 'disabled-cache' test command's cache key is a constant string rather
than a content hash, so once a pass was written to the test success cache
the command matched on every later run and was skipped permanently (e.g.
all TXE tests on merge-train/fairies; PR #23106 logged 709 SKIPPED / 0 run).

Stop writing success-cache entries for disabled-cache commands so the cache
can no longer be poisoned, and treat any pre-existing poisoned entry as a
fatal CI error (purging it) instead of silently skipping. This mirrors
cache_content_hash, which already hard-fails in CI when uncommitted/untracked
files would otherwise disable the cache.
@AztecBot AztecBot added the claudebox Owned by claudebox. it can push to this PR. label May 29, 2026
@AztecBot

AztecBot commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator Author

Automatically closing this stale claudebox draft PR (no updates for 5+ days). Re-open if still needed.

@AztecBot AztecBot closed this Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claudebox Owned by claudebox. it can push to this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant