Repoint the Codex gpt-5-codex alias in Daily Cache Strategy Analyzer — it resolves to gpt-5-codex-alpha-2025-11-07, which the API proxy returns 404 for, failing the run.
Problem statement
Daily Cache Strategy Analyzer (engine codex) fails when its experiment variant selects gpt-5-codex. The Codex CLI resolves that to the dated snapshot gpt-5-codex-alpha-2025-11-07, which the in-cluster API proxy does not serve, returning 404 Not Found: Model not found. All sampling retries (1/5–5/5) and both harness attempts hit the same 404, so the agent produces no output and the job fails.
Evidence
- Run §27843593919 (2026-06-19 19:00 UTC,
main, schedule), Execute Codex CLI:
{"type":"error","message":"unexpected status 404 Not Found: Model not found gpt-5-codex-alpha-2025-11-07, url: (172.30.0.30/redacted) ..."}
{"type":"turn.failed", ...} then WARN codex_models_manager::model_info: Unknown model gpt-5-codex is used. This will use fallback model metadata.
[codex-harness] all 3 retries exhausted — giving up (exitCode=1); classifier isInvalidModelError=false (the 404 is not caught as an invalid-model error).
- Config:
.github/workflows/daily-cache-strategy-analyzer.md lines 17–24 → engine.id: codex, model: ${{ needs.activation.outputs.model_size }}, experiment variants [gpt-5.4, gpt-5-codex].
- Pattern: failed 4 of last 6 scheduled runs — 27843593919, 27783093380, 27713303874, 27571281247 (fail); 27642839368, 27508571355 (success). Failures correlate with the
gpt-5-codex variant; gpt-5.4 runs succeed.
Probable root cause
The gpt-5-codex model alias maps to a retired alpha snapshot (...-alpha-2025-11-07) that the proxy no longer serves. The experiment keeps selecting gpt-5-codex, so ~half of runs route to a dead model and 404.
Proposed remediation
- Update the Codex alias for
gpt-5-codex to a currently-served snapshot in the proxy/model-alias config, or drop gpt-5-codex from the experiment variants until the proxy serves it.
- Classify upstream
404 Model not found as an invalid-model/non-retryable error in codex-harness so it fails fast with a clear message instead of 5×2 silent sampling retries.
- Add a model-availability pre-flight (the harness already calls
awf-reflect//models) that rejects an unserved model before the agent turn.
Success criteria / verification
- Scheduled runs selecting
gpt-5-codex complete without 404 ... Model not found gpt-5-codex-alpha-2025-11-07.
- A 404 model error is surfaced as a named, non-retryable failure class.
- Model-alias config is covered by a test asserting every advertised Codex alias resolves to a served model.
Parent: #39883. Related class: previously-tracked Codex model-404 cluster referenced in #39946 (this is a current, workflow-specific recurrence with a concrete dead alias).
References:
Generated by 🔍 [aw] Failure Investigator (6h) · 444.2 AIC · ⌖ 12.3 AIC · ⊞ 4.9K · ◷
Repoint the Codex
gpt-5-codexalias inDaily Cache Strategy Analyzer— it resolves togpt-5-codex-alpha-2025-11-07, which the API proxy returns 404 for, failing the run.Problem statement
Daily Cache Strategy Analyzer(enginecodex) fails when its experiment variant selectsgpt-5-codex. The Codex CLI resolves that to the dated snapshotgpt-5-codex-alpha-2025-11-07, which the in-cluster API proxy does not serve, returning404 Not Found: Model not found. All sampling retries (1/5–5/5) and both harness attempts hit the same 404, so the agent produces no output and the job fails.Evidence
main,schedule),Execute Codex CLI:{"type":"error","message":"unexpected status 404 Not Found: Model not found gpt-5-codex-alpha-2025-11-07, url: (172.30.0.30/redacted) ..."}{"type":"turn.failed", ...}thenWARN codex_models_manager::model_info: Unknown model gpt-5-codex is used. This will use fallback model metadata.[codex-harness] all 3 retries exhausted — giving up (exitCode=1); classifierisInvalidModelError=false(the 404 is not caught as an invalid-model error)..github/workflows/daily-cache-strategy-analyzer.mdlines 17–24 →engine.id: codex,model: ${{ needs.activation.outputs.model_size }}, experiment variants[gpt-5.4, gpt-5-codex].gpt-5-codexvariant;gpt-5.4runs succeed.Probable root cause
The
gpt-5-codexmodel alias maps to a retired alpha snapshot (...-alpha-2025-11-07) that the proxy no longer serves. The experiment keeps selectinggpt-5-codex, so ~half of runs route to a dead model and 404.Proposed remediation
gpt-5-codexto a currently-served snapshot in the proxy/model-alias config, or dropgpt-5-codexfrom the experimentvariantsuntil the proxy serves it.404 Model not foundas an invalid-model/non-retryable error incodex-harnessso it fails fast with a clear message instead of 5×2 silent sampling retries.awf-reflect//models) that rejects an unserved model before the agent turn.Success criteria / verification
gpt-5-codexcomplete without404 ... Model not found gpt-5-codex-alpha-2025-11-07.Parent: #39883. Related class: previously-tracked Codex model-404 cluster referenced in #39946 (this is a current, workflow-specific recurrence with a concrete dead alias).
References:
Related to [aw-failures] [aw] Failure Investigation Report — 6h window (2026-06-17 19:34 UTC) #39883