Executive Summary
Fix the AWF firewall startup path — a single infrastructure fault (awf-cli-proxy cannot reach the host DIFC proxy at host.docker.internal:18443) aborted every scheduled agentic run in this window before the agent was ever invoked. Three distinct workflows failed identically in the last 6h, one of them (PR Sous Chef) three times consecutively. Token usage and turn count were 0 on every failure — the engine never started, so this is purely an infra/firewall race, not a model, prompt, or auth problem.
This signature is new and untracked. It is distinct from #37989 (Copilot PAT auth failure at turn 1 — that reaches the agent; this never does).
Failure Cluster
| Cluster |
Workflows affected |
Runs |
Engine(s) |
Class |
First→Last seen |
| DIFC proxy unreachable — agent never invoked |
PR Sous Chef, Issue Monster, Daily Code Metrics |
5 (3 unique signatures + PR Sous Chef ×3) |
Copilot and Claude — engine-agnostic |
Infra / firewall startup |
17:30 → 19:19 UTC |
Per-workflow recovery state:
- PR Sous Chef (hourly) — P0: 3 consecutive failures (17:30, 18:26, 19:19 UTC). 100% failure for 2h; last 3 runs all down. Onset ~17:30.
- Issue Monster (hourly) — P1: failed 19:18 (succeeded 18:17). First occurrence.
- Daily Code Metrics (daily) — P1: failed 19:07. Only run of the day; no recovery sample yet.
Evidence
Identical fatal error in all three runs' agent-stdio.log:
[cli-proxy] DIFC proxy probe failed (attempt 1/2), retrying in 1s...
[cli-proxy] ERROR: DIFC proxy liveness probe failed for localhost:18443 (gh api exit=0)
[cli-proxy] gh api error: Get "(localhost/redacted) dial tcp [::1]:18443: connect: connection refused
[cli-proxy] Failing fast to avoid repeated in-agent retries
[ERROR] Fatal error: AWF firewall failed to start: awf-cli-proxy could not connect to the
external DIFC proxy ... The agent was never invoked.
Process exiting with code: 1
The host proxy was healthy — in all three failed runs the host-side start_cli_proxy.sh logged CLI proxy ready on port 18443 (its IPv4 `curl (localhost/redacted) succeeded) seconds before the sidecar's probe failed:
| Run |
Workflow |
Host proxy "ready" |
Sidecar fails fast |
| 27229990045 |
PR Sous Chef |
19:23:49 |
~19:24:10 |
| 27229956123 |
Issue Monster |
19:21:41 |
~19:22:02 |
| 27229336540 |
Daily Code Metrics |
19:10:26 |
~19:10:50 |
Audit-diff vs. last successful runs (firewall + GitHub API deltas)
audit-diff between failed and prior-successful runs shows no model/MCP regression — only the expected engine-routing difference (api.anthropic.com vs api.githubcopilot.com) and elevated core_consumed on the GitHub API (16 → 281–342, +1656–2038%) consistent with retried gh api probes against the dead sidecar socket. The firewall summary reports has_anomalies: false; the failure is the startup abort, not in-flight traffic.
firewall_diff: +api.githubcopilot.com / -api.anthropic.com (engine routing only, no anomalies)
run_metrics_diff: token_usage 0 → 0 (agent never ran); core_consumed +1656%..+2038% (probe retries)
Probable Root Cause
Two compounding faults, both in the gh-aw-controlled firewall startup path:
- IPv4/IPv6 readiness mismatch.
start_cli_proxy.sh binds the host proxy on --listen 0.0.0.0:18443 (IPv4) and validates it with curl (localhost/redacted), which resolves to 127.0.0.1. The AWF awf-cli-proxysidecar probeslocalhost:18443and resolves it to **IPv6[::1]:18443**, where nothing is listening → connection refused`. The host readiness gate therefore passes while the path the sidecar actually uses is unverified.
- Fail-fast is too aggressive. The sidecar gives up after 2 attempts ~1s apart ("Failing fast"), then aborts the entire run with no agent-level recovery. Any momentary blip in
host.docker.internal reachability is fatal — which is why the fault is intermittent (works most of the time) yet unrecoverable when it hits.
Existing Issue Correlation
No open issue tracked closed/fixed this window. No closures performed — no fresh evidence that any open agentic-workflows issue is resolved or stale.
Fix Roadmap
- P0 — Stop fail-fast from killing the run. Increase the
awf-cli-proxy liveness-probe retry budget / backoff (current 2 attempts/~1s) so a transient host.docker.internal blip is survived rather than aborting before agent invocation. Owned by the AWF bundle (v0.26.0+); pin/configure via gh-aw if exposed.
- P0 — Close the IPv4/IPv6 gap. Make the host proxy reachable on both stacks (bind IPv6 loopback too) or force the sidecar to dial
127.0.0.1:18443, or have start_cli_proxy.sh validate the actual host.docker.internal path instead of host-local curl localhost. (See sub-issue.)
- P1 — Add startup observability. Emit the sidecar's resolved dial target and the host proxy bind address into the run summary so this signature is one-glance diagnosable next time.
Sub-Issues Created
- IPv4/IPv6 + fail-fast fix for
awf-cli-proxy ↔ host DIFC proxy (linked below).
References: §27229990045 · §27229956123 · §27229336540
Generated by 🔍 [aw] Failure Investigator (6h) · 218 AIC · ⌖ 14 AIC · ⊞ 5.1K · ◷
Executive Summary
Fix the AWF firewall startup path — a single infrastructure fault (
awf-cli-proxycannot reach the host DIFC proxy athost.docker.internal:18443) aborted every scheduled agentic run in this window before the agent was ever invoked. Three distinct workflows failed identically in the last 6h, one of them (PR Sous Chef) three times consecutively. Token usage and turn count were 0 on every failure — the engine never started, so this is purely an infra/firewall race, not a model, prompt, or auth problem.This signature is new and untracked. It is distinct from #37989 (Copilot PAT auth failure at turn 1 — that reaches the agent; this never does).
Failure Cluster
Per-workflow recovery state:
Evidence
Identical fatal error in all three runs'
agent-stdio.log:The host proxy was healthy — in all three failed runs the host-side
start_cli_proxy.shloggedCLI proxy ready on port 18443(its IPv4 `curl (localhost/redacted) succeeded) seconds before the sidecar's probe failed:Audit-diff vs. last successful runs (firewall + GitHub API deltas)
audit-diffbetween failed and prior-successful runs shows no model/MCP regression — only the expected engine-routing difference (api.anthropic.comvsapi.githubcopilot.com) and elevatedcore_consumedon the GitHub API (16 → 281–342, +1656–2038%) consistent with retriedgh apiprobes against the dead sidecar socket. The firewall summary reportshas_anomalies: false; the failure is the startup abort, not in-flight traffic.Probable Root Cause
Two compounding faults, both in the gh-aw-controlled firewall startup path:
start_cli_proxy.shbinds the host proxy on--listen 0.0.0.0:18443(IPv4) and validates it withcurl (localhost/redacted), which resolves to127.0.0.1. The AWFawf-cli-proxysidecar probeslocalhost:18443and resolves it to **IPv6[::1]:18443**, where nothing is listening →connection refused`. The host readiness gate therefore passes while the path the sidecar actually uses is unverified.host.docker.internalreachability is fatal — which is why the fault is intermittent (works most of the time) yet unrecoverable when it hits.Existing Issue Correlation
[aw-failures] Intermittent Copilot SDK auth failure (PAT not supported)— distinct, keep open. That failure reaches turn 1; this one never invokes the agent. No PAT-auth recurrence observed in this 6h window (no fresh evidence to close).[aw] No-Op Runs— unrelated topic (no-op completions, not startup failures). Keep open.[agentic-token-audit] Daily AIC Usage Audit— audit report, not a failure tracker. Keep open.No open issue tracked closed/fixed this window. No closures performed — no fresh evidence that any open
agentic-workflowsissue is resolved or stale.Fix Roadmap
awf-cli-proxyliveness-probe retry budget / backoff (current 2 attempts/~1s) so a transienthost.docker.internalblip is survived rather than aborting before agent invocation. Owned by the AWF bundle (v0.26.0+); pin/configure via gh-aw if exposed.127.0.0.1:18443, or havestart_cli_proxy.shvalidate the actualhost.docker.internalpath instead of host-localcurl localhost. (See sub-issue.)Sub-Issues Created
awf-cli-proxy↔ host DIFC proxy (linked below).References: §27229990045 · §27229956123 · §27229336540