Skip to content

[aw-failures] [aw] Daily Cache Strategy Analyzer fails on Codex 'gpt-5-codex' — alias resolves to retired gpt-5-codex-alpha-2025 [Content truncated due to length] #40381

Description

@github-actions

Repoint the Codex gpt-5-codex alias in Daily Cache Strategy Analyzer — it resolves to gpt-5-codex-alpha-2025-11-07, which the API proxy returns 404 for, failing the run.

Problem statement

Daily Cache Strategy Analyzer (engine codex) fails when its experiment variant selects gpt-5-codex. The Codex CLI resolves that to the dated snapshot gpt-5-codex-alpha-2025-11-07, which the in-cluster API proxy does not serve, returning 404 Not Found: Model not found. All sampling retries (1/5–5/5) and both harness attempts hit the same 404, so the agent produces no output and the job fails.

Evidence

  • Run §27843593919 (2026-06-19 19:00 UTC, main, schedule), Execute Codex CLI:
    • {"type":"error","message":"unexpected status 404 Not Found: Model not found gpt-5-codex-alpha-2025-11-07, url: (172.30.0.30/redacted) ..."}
    • {"type":"turn.failed", ...} then WARN codex_models_manager::model_info: Unknown model gpt-5-codex is used. This will use fallback model metadata.
    • [codex-harness] all 3 retries exhausted — giving up (exitCode=1); classifier isInvalidModelError=false (the 404 is not caught as an invalid-model error).
  • Config: .github/workflows/daily-cache-strategy-analyzer.md lines 17–24 → engine.id: codex, model: ${{ needs.activation.outputs.model_size }}, experiment variants [gpt-5.4, gpt-5-codex].
  • Pattern: failed 4 of last 6 scheduled runs — 27843593919, 27783093380, 27713303874, 27571281247 (fail); 27642839368, 27508571355 (success). Failures correlate with the gpt-5-codex variant; gpt-5.4 runs succeed.

Probable root cause

The gpt-5-codex model alias maps to a retired alpha snapshot (...-alpha-2025-11-07) that the proxy no longer serves. The experiment keeps selecting gpt-5-codex, so ~half of runs route to a dead model and 404.

Proposed remediation

  1. Update the Codex alias for gpt-5-codex to a currently-served snapshot in the proxy/model-alias config, or drop gpt-5-codex from the experiment variants until the proxy serves it.
  2. Classify upstream 404 Model not found as an invalid-model/non-retryable error in codex-harness so it fails fast with a clear message instead of 5×2 silent sampling retries.
  3. Add a model-availability pre-flight (the harness already calls awf-reflect//models) that rejects an unserved model before the agent turn.

Success criteria / verification

  • Scheduled runs selecting gpt-5-codex complete without 404 ... Model not found gpt-5-codex-alpha-2025-11-07.
  • A 404 model error is surfaced as a named, non-retryable failure class.
  • Model-alias config is covered by a test asserting every advertised Codex alias resolves to a served model.

Parent: #39883. Related class: previously-tracked Codex model-404 cluster referenced in #39946 (this is a current, workflow-specific recurrence with a concrete dead alias).

References:

Generated by 🔍 [aw] Failure Investigator (6h) · 444.2 AIC · ⌖ 12.3 AIC · ⊞ 4.9K ·

  • expires on Jun 26, 2026, 11:36 AM UTC-08:00

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions