Skip to content

[aw-failures] AWF firewall fails to start — awf-cli-proxy cannot reach host DIFC proxy (host.docker.internal:18443); agents neve [Content truncated due to length] #38201

Description

@github-actions

Executive Summary

Fix the AWF firewall startup path — a single infrastructure fault (awf-cli-proxy cannot reach the host DIFC proxy at host.docker.internal:18443) aborted every scheduled agentic run in this window before the agent was ever invoked. Three distinct workflows failed identically in the last 6h, one of them (PR Sous Chef) three times consecutively. Token usage and turn count were 0 on every failure — the engine never started, so this is purely an infra/firewall race, not a model, prompt, or auth problem.

This signature is new and untracked. It is distinct from #37989 (Copilot PAT auth failure at turn 1 — that reaches the agent; this never does).

Failure Cluster

Cluster Workflows affected Runs Engine(s) Class First→Last seen
DIFC proxy unreachable — agent never invoked PR Sous Chef, Issue Monster, Daily Code Metrics 5 (3 unique signatures + PR Sous Chef ×3) Copilot and Claude — engine-agnostic Infra / firewall startup 17:30 → 19:19 UTC

Per-workflow recovery state:

  • PR Sous Chef (hourly) — P0: 3 consecutive failures (17:30, 18:26, 19:19 UTC). 100% failure for 2h; last 3 runs all down. Onset ~17:30.
  • Issue Monster (hourly) — P1: failed 19:18 (succeeded 18:17). First occurrence.
  • Daily Code Metrics (daily) — P1: failed 19:07. Only run of the day; no recovery sample yet.

Evidence

Identical fatal error in all three runs' agent-stdio.log:

[cli-proxy] DIFC proxy probe failed (attempt 1/2), retrying in 1s...
[cli-proxy] ERROR: DIFC proxy liveness probe failed for localhost:18443 (gh api exit=0)
[cli-proxy] gh api error: Get "(localhost/redacted) dial tcp [::1]:18443: connect: connection refused
[cli-proxy] Failing fast to avoid repeated in-agent retries
[ERROR] Fatal error: AWF firewall failed to start: awf-cli-proxy could not connect to the
external DIFC proxy ... The agent was never invoked.
Process exiting with code: 1

The host proxy was healthy — in all three failed runs the host-side start_cli_proxy.sh logged CLI proxy ready on port 18443 (its IPv4 `curl (localhost/redacted) succeeded) seconds before the sidecar's probe failed:

Run Workflow Host proxy "ready" Sidecar fails fast
27229990045 PR Sous Chef 19:23:49 ~19:24:10
27229956123 Issue Monster 19:21:41 ~19:22:02
27229336540 Daily Code Metrics 19:10:26 ~19:10:50
Audit-diff vs. last successful runs (firewall + GitHub API deltas)

audit-diff between failed and prior-successful runs shows no model/MCP regression — only the expected engine-routing difference (api.anthropic.com vs api.githubcopilot.com) and elevated core_consumed on the GitHub API (16 → 281–342, +1656–2038%) consistent with retried gh api probes against the dead sidecar socket. The firewall summary reports has_anomalies: false; the failure is the startup abort, not in-flight traffic.

firewall_diff: +api.githubcopilot.com / -api.anthropic.com  (engine routing only, no anomalies)
run_metrics_diff: token_usage 0 → 0 (agent never ran); core_consumed +1656%..+2038% (probe retries)

Probable Root Cause

Two compounding faults, both in the gh-aw-controlled firewall startup path:

  1. IPv4/IPv6 readiness mismatch. start_cli_proxy.sh binds the host proxy on --listen 0.0.0.0:18443 (IPv4) and validates it with curl (localhost/redacted), which resolves to 127.0.0.1. The AWF awf-cli-proxysidecar probeslocalhost:18443and resolves it to **IPv6[::1]:18443**, where nothing is listening → connection refused`. The host readiness gate therefore passes while the path the sidecar actually uses is unverified.
  2. Fail-fast is too aggressive. The sidecar gives up after 2 attempts ~1s apart ("Failing fast"), then aborts the entire run with no agent-level recovery. Any momentary blip in host.docker.internal reachability is fatal — which is why the fault is intermittent (works most of the time) yet unrecoverable when it hits.

Existing Issue Correlation

No open issue tracked closed/fixed this window. No closures performed — no fresh evidence that any open agentic-workflows issue is resolved or stale.

Fix Roadmap

  • P0 — Stop fail-fast from killing the run. Increase the awf-cli-proxy liveness-probe retry budget / backoff (current 2 attempts/~1s) so a transient host.docker.internal blip is survived rather than aborting before agent invocation. Owned by the AWF bundle (v0.26.0+); pin/configure via gh-aw if exposed.
  • P0 — Close the IPv4/IPv6 gap. Make the host proxy reachable on both stacks (bind IPv6 loopback too) or force the sidecar to dial 127.0.0.1:18443, or have start_cli_proxy.sh validate the actual host.docker.internal path instead of host-local curl localhost. (See sub-issue.)
  • P1 — Add startup observability. Emit the sidecar's resolved dial target and the host proxy bind address into the run summary so this signature is one-glance diagnosable next time.

Sub-Issues Created

  • IPv4/IPv6 + fail-fast fix for awf-cli-proxy ↔ host DIFC proxy (linked below).

References: §27229990045 · §27229956123 · §27229336540

Generated by 🔍 [aw] Failure Investigator (6h) · 218 AIC · ⌖ 14 AIC · ⊞ 5.1K ·

  • expires on Jun 16, 2026, 11:46 AM UTC-08:00

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions