Parent: #39883 (6h failure investigation report).
Problem statement
Daily BYOK Ollama Test fails at the agent job (Execute GitHub Copilot CLI) with 0 turns — the agent never starts a turn. The Copilot harness aborts during model discovery (awf-reflect) because the BYOK API proxy returns HTTP 503 on its /v1/models endpoint, exhausting all retries.
Affected workflow and run IDs
- Workflow:
Daily BYOK Ollama Test (.github/workflows/daily-byok-ollama-test.lock.yml), engine copilot (GitHub Copilot CLI 1.0.63), model claude-sonnet-4.6, trigger schedule.
- Failed run: §27851504668 — 2026-06-19 22:36 UTC,
main. turns=0, agentic_fraction=0, duration 4.3m, exit code 1.
- Baseline: audit reports
baseline_found=false (no successful cohort run for comparison).
Evidence (agent-stdio.log)
[warning] awf-reflect models fetch for (apiproxy/redacted) failed (attempt 1/5): models fetch returned 503
... retries 2/5 .. 5/5 ...
[warning] awf-reflect models fetch ... failed after 4 retry attempts: models fetch returned 503 for (apiproxy/redacted)
[copilot-harness] awf-reflect: models fetch returned 503 for (apiproxy/redacted)
[copilot-harness] done: exitCode=1 totalDuration=52s
Firewall audit: 22 requests, 0 blocked — the network reached the proxy; the proxy itself answered 503, so this is not a firewall/policy denial.
Probable root cause
The BYOK backend behind awf-api-proxy (api-proxy:10002) did not serve /v1/models (Ollama model not pulled / server not yet ready / upstream 503). Model discovery is treated as fatal, so the harness exits before any agent turn. The retry budget (5 attempts over ~3.75s) is too short to ride out a cold Ollama start.
Proposed remediation
- Add a readiness gate on
/v1/models (expect 200) before the agent job starts, and pre-pull/warm the target Ollama model.
- Increase the
awf-reflect model-fetch retry budget/backoff on the BYOK path so a transient/cold-start 503 does not hard-fail the run.
- Emit a classified error (e.g.
isBYOKModelDiscoveryError) instead of a bare exitCode=1, so this is distinguishable from agent-side failures in audit.
Success criteria / verification
Generated by 🔍 [aw] Failure Investigator (6h) · 172.3 AIC · ⌖ 11.7 AIC · ⊞ 4.9K · ◷
Parent: #39883 (6h failure investigation report).
Problem statement
Daily BYOK Ollama Testfails at theagentjob (Execute GitHub Copilot CLI) with 0 turns — the agent never starts a turn. The Copilot harness aborts during model discovery (awf-reflect) because the BYOK API proxy returns HTTP503on its/v1/modelsendpoint, exhausting all retries.Affected workflow and run IDs
Daily BYOK Ollama Test(.github/workflows/daily-byok-ollama-test.lock.yml), enginecopilot(GitHub Copilot CLI1.0.63), modelclaude-sonnet-4.6, triggerschedule.main.turns=0,agentic_fraction=0, duration 4.3m, exit code 1.baseline_found=false(no successful cohort run for comparison).Evidence (agent-stdio.log)
Firewall audit: 22 requests, 0 blocked — the network reached the proxy; the proxy itself answered 503, so this is not a firewall/policy denial.
Probable root cause
The BYOK backend behind
awf-api-proxy(api-proxy:10002) did not serve/v1/models(Ollama model not pulled / server not yet ready / upstream 503). Model discovery is treated as fatal, so the harness exits before any agent turn. The retry budget (5 attempts over ~3.75s) is too short to ride out a cold Ollama start.Proposed remediation
/v1/models(expect 200) before the agent job starts, and pre-pull/warm the target Ollama model.awf-reflectmodel-fetch retry budget/backoff on the BYOK path so a transient/cold-start 503 does not hard-fail the run.isBYOKModelDiscoveryError) instead of a bareexitCode=1, so this is distinguishable from agent-side failures in audit.Success criteria / verification
turns=0exit 1./v1/modelsreturns 200 before the agent starts.Related to [aw-failures] [aw] Failure Investigation Report — 6h window (2026-06-17 19:34 UTC) #39883