Skip to content

fix(network-isolation): break topology-attach ordering deadlock starving cli-proxy health gate#5544

Merged
lpcox merged 6 commits into
mainfrom
copilot/network-isolation-fix-deadlock
Jun 25, 2026
Merged

fix(network-isolation): break topology-attach ordering deadlock starving cli-proxy health gate#5544
lpcox merged 6 commits into
mainfrom
copilot/network-isolation-fix-deadlock

Conversation

Copilot AI commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

In --network-isolation + --topology-attach mode, the cli-proxy liveness probe fires before topology peers are joined to awf-net, causing EAI_AGAIN → fail-fast → agent never invoked. The deadlock is structural: docker compose up -d blocks until cli-proxy is healthy, but the peer attach that would satisfy that probe only ran after startContainers() returned.

Fix A — Phased startup (eliminates the deadlock)

startContainers() now accepts an optional onNetworkReady callback. When provided, startup is split into three phases before the existing health-gated bring-up:

  1. docker compose up -d --no-deps squid-proxy — creates awf-net without waiting on dependents
  2. onNetworkReady() — attaches topology peers to awf-net
  3. docker compose up -d — full bring-up; cli-proxy probe now resolves the peer

cli-workflow.ts wires connectTopologyContainers as the onNetworkReady hook and removes the now-redundant post-startup Step 2.5. Non-topology runs (no onNetworkReady) are completely unaffected; all existing api-proxy/squid one-shot retry logic is preserved in Phase 3.

Before:  compose up -d (blocks) → [DEADLOCK] → attach peers → never runs
After:   up squid-proxy → attach peers → compose up -d (succeeds)

Fix B — DNS failure classification in cli-proxy entrypoint (hardening)

The liveness probe classifier now recognises EAI_AGAIN/ENOTFOUND/getaddrinfo as dns-not-yet-ready rather than unknown. With Fix A this path should never be hit, but the classification ensures residual races surface with a clear diagnostic rather than a confusing unknown-failure message.

Copilot AI changed the title [WIP] Fix ordering deadlock in cli-proxy health check fix(network-isolation): break topology-attach ordering deadlock starving cli-proxy health gate Jun 25, 2026
Copilot AI requested a review from lpcox June 25, 2026 21:19
Copilot finished work on behalf of lpcox June 25, 2026 21:19
@lpcox lpcox marked this pull request as ready for review June 25, 2026 21:28
Copilot AI review requested due to automatic review settings June 25, 2026 21:28

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Restructures container startup in --network-isolation + --topology-attach runs to eliminate a startup-ordering deadlock where docker compose up -d blocks on cli-proxy health before topology peers are attached to awf-net.

Changes:

  • Add an optional onNetworkReady callback to startContainers() and split startup so squid-proxy (and awf-net) come up before attaching topology peers, then perform the normal health-gated full bring-up.
  • Wire topology peer attachment into the new callback in runMainWorkflow() and remove the post-start attachment step.
  • Harden cli-proxy liveness probe diagnostics by classifying DNS-related resolution failures.
Show a summary per file
File Description
src/container-start.test.ts Adds unit tests covering phased startup sequencing and retry behavior under the new callback flow.
src/container-lifecycle.ts Implements phased startup via onNetworkReady and updates the function contract/docs.
src/cli-workflow.ts Passes topology attachment as onNetworkReady during container startup and removes redundant post-start attachment.
src/cli-workflow.test.ts Updates workflow tests to validate the callback-based topology attachment behavior.
containers/cli-proxy/entrypoint.sh Improves liveness probe failure classification for DNS-related errors.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 5/5 changed files
  • Comments generated: 4

Comment thread src/container-lifecycle.ts Outdated
Comment thread src/container-lifecycle.ts Outdated
Comment thread src/container-start.test.ts Outdated
Comment thread containers/cli-proxy/entrypoint.sh Outdated
lpcox and others added 2 commits June 25, 2026 14:37
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@github-actions

Copy link
Copy Markdown
Contributor

✅ Copilot review passed with no inline comments.

@copilot Add the ready-for-aw label to this PR to trigger agentic CI smoke tests.

lpcox and others added 2 commits June 25, 2026 14:38
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Chroot tests passed! Smoke Chroot - All security and functionality tests succeeded.

@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Build Test Suite completed successfully!

@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Security Guard failed. Please review the logs for details.

@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Smoke Copilot BYOK AOAI (api-key) completed. Copilot AOAI BYOK (api-key) mode operational. 🔓

@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Smoke Gemini completed. All facets verified. 💎

@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

📰 VERDICT: Smoke Copilot has concluded. All systems operational. This is a developing story. 🎤

@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

📡 Smoke OTel Tracing completed. All tracing scenarios validated. ✅

@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

🔌 Smoke Services — All services reachable! ✅

@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Smoke Claude passed

@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

✨ The prophecy is fulfilled... Smoke Codex has completed its mystical journey. The stars align. 🌟

@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Smoke Copilot BYOK completed. Copilot BYOK mode operational. 🔓

@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Contribution Check completed successfully!

Contribution guidelines review complete for PR #5544: no important gaps found in the provided CONTRIBUTING.md checklist context.

@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Smoke Copilot BYOK AOAI (Entra) completed. Copilot AOAI BYOK (Entra) mode operational. 🔓

@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

🔑 Smoke Copilot PAT PAT auth validated. All systems operational. ✅

@github-actions

Copy link
Copy Markdown
Contributor

Smoke Test: Claude Engine Validation

  • API status: ✅ PASS
  • gh check: ✅ PASS
  • File status: ✅ PASS

Overall result: PASS

Generated by Smoke Claude for issue #5544 · 61.4 AIC · ⊞ 3.3K ·

@github-actions

Copy link
Copy Markdown
Contributor

Smoke Test: Copilot BYOK (Direct) Mode

  • ✅ GitHub MCP: Connected
  • ✅ GitHub.com: HTTP 200
  • ✅ File I/O: Verified
  • ✅ BYOK Inference: Active

Status: PASS — Running in direct BYOK mode (COPILOT_PROVIDER_API_KEY) via api-proxy → api.githubcopilot.com

🔑 BYOK report filed by Smoke Copilot BYOK

@github-actions

Copy link
Copy Markdown
Contributor

🔥 Smoke Test Results

Test Status
GitHub MCP connectivity
GitHub.com HTTP connectivity
File write/read

Overall: PASS

PR: fix(network-isolation): break topology-attach ordering deadlock starving cli-proxy health gate
Author: @Copilot | Assignees: @lpcox @Copilot

📰 BREAKING: Report filed by Smoke Copilot

@github-actions

Copy link
Copy Markdown
Contributor

🔬 Smoke Test: Copilot PAT Auth — PASS

Test Result
GitHub MCP connectivity
GitHub.com HTTP (200)
File write/read

PR: fix(network-isolation): break topology-attach ordering deadlock starving cli-proxy health gate
Author: @Copilot | Assignees: @lpcox, @Copilot
Auth mode: PAT (COPILOT_GITHUB_TOKEN)

🔑 PAT report filed by Smoke Copilot PAT

@github-actions

Copy link
Copy Markdown
Contributor

Chroot Version Comparison Results

Runtime Host Version Chroot Version Match?
Python Python 3.12.13 Python 3.12.3 ❌ NO
Node.js v24.17.0 v22.23.0 ❌ NO
Go go1.22.12 go1.22.12 ✅ YES

Overall: ❌ Not all tests passed — Python and Node.js versions differ between host and chroot environments.

Tested by Smoke Chroot

@github-actions

Copy link
Copy Markdown
Contributor

✅ GitHub MCP Testing
✅ GitHub.com Connectivity
✅ File Write/Read Test
✅ BYOK Inference Test

Running in direct BYOK mode (AWF_AUTH_TYPE=github-oidc + AWF_AUTH_AZURE_* + COPILOT_PROVIDER_BASE_URL) via api-proxy → Azure OpenAI (Foundry, o4-mini-aw) authenticated via Microsoft Entra

Overall status: PASS

@lpcox

🪪 BYOK (AOAI Entra) report filed by Smoke Copilot BYOK AOAI (Entra)

@github-actions

Copy link
Copy Markdown
Contributor

🔭 Smoke Test: API Proxy OpenTelemetry Tracing

Scenario Result Detail
1. Module Loading otel.js loaded; exports: startRequestSpan, setTokenAttributes, setBudgetAttributes, endSpan, endSpanError, shutdown, isEnabled + internal helpers
2. Test Suite 39/39 tests passed in otel.test.js (0 failures)
3. Env Var Forwarding api-proxy-env-config.ts forwards GH_AW_OTLP_ENDPOINTS, OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_EXPORTER_OTLP_HEADERS, GITHUB_AW_OTEL_TRACE_ID, GITHUB_AW_OTEL_PARENT_SPAN_ID, OTEL_SERVICE_NAME
4. Token Tracker Integration onUsage callback present in token-tracker-http.js (line 283 destructured, invoked at line 324)
5. OTEL Diagnostics No spans exported (expected — no OTEL endpoint configured in test run; graceful degradation confirmed)

All scenarios pass. OTEL tracing integration is healthy.

📡 OTel tracing validated by Smoke OTel Tracing

@github-actions

Copy link
Copy Markdown
Contributor

Smoke Test: Gemini Engine Validation

  • PR titles: Unknown (Unable to fetch)
  • GitHub MCP Testing: ❌
  • GitHub.com Connectivity: ❌
  • File Writing Testing: ✅
  • Bash Tool Testing: ✅

Overall status: FAIL

Warning

Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

  • localhost

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "localhost"

See Network Configuration for more information.

💎 Faceted by Smoke Gemini

@github-actions

Copy link
Copy Markdown
Contributor

Merged PRs reviewed:

  • refactor(api-proxy): extract sliding-window data structure into rate-limiter-window.js
  • refactor: split agent-volumes-mounts.test.ts by feature area
    Checks:
  • GitHub title: ✅
  • Smoke file: ✅
  • Discussion: ✅
  • Build: ✅
    Overall: PASS

Warning

Firewall blocked 1 domain

The following domain was blocked by the firewall during workflow execution:

  • registry.npmjs.org

To allow these domains, add them to the network.allowed list in your workflow frontmatter:

network:
  allowed:
    - defaults
    - "registry.npmjs.org"

See Network Configuration for more information.

🔮 The oracle has spoken through Smoke Codex

@github-actions

Copy link
Copy Markdown
Contributor

@Copilot @lpcox

  • GitHub MCP connectivity: ✅
  • GitHub.com HTTP connectivity: ✅
  • Sandbox file I/O: ✅
  • BYOK inference path: ✅
    Running in direct BYOK mode (COPILOT_PROVIDER_API_KEY + COPILOT_PROVIDER_BASE_URL) via api-proxy → Azure OpenAI (Foundry, o4-mini-aw)
    Overall: PASS

🔑 BYOK (AOAI api-key) report filed by Smoke Copilot BYOK AOAI (api-key)

@github-actions

Copy link
Copy Markdown
Contributor

🏗️ Build Test Suite Results

Ecosystem Project Build/Install Tests Status
Bun elysia 1/1 passed ✅ PASS
Bun hono 1/1 passed ✅ PASS
C++ fmt N/A ✅ PASS
C++ json N/A ✅ PASS
Deno oak N/A 1/1 passed ✅ PASS
Deno std N/A 1/1 passed ✅ PASS
.NET hello-world N/A ✅ PASS
.NET json-parse N/A ✅ PASS
Go color passed ✅ PASS
Go env passed ✅ PASS
Go uuid passed ✅ PASS
Java gson 1/1 passed ✅ PASS
Java caffeine 1/1 passed ✅ PASS
Node.js clsx all passed ✅ PASS
Node.js execa all passed ✅ PASS
Node.js p-limit all passed ✅ PASS
Rust fd 1/1 passed ✅ PASS
Rust zoxide 1/1 passed ✅ PASS

Overall: 8/8 ecosystems passed — ✅ PASS

Generated by Build Test Suite for issue #5544 · 69.9 AIC · ⊞ 7.8K ·

@github-actions

Copy link
Copy Markdown
Contributor

Smoke Test: GitHub Actions Services Connectivity

Check Result
Redis PING ❌ timeout (no response)
PostgreSQL pg_isready ❌ no response
PostgreSQL SELECT 1 ❌ timeout (no response)

Overall: FAIL

host.docker.internal resolves to 172.17.0.1 but ports 6379 (Redis) and 5432 (PostgreSQL) are unreachable from inside the AWF sandbox — blocked by iptables firewall rules (per setup-iptables.sh which explicitly blocks database ports).

🔌 Service connectivity validated by Smoke Services

@lpcox lpcox merged commit 135e87f into main Jun 25, 2026
85 of 88 checks passed
@lpcox lpcox deleted the copilot/network-isolation-fix-deadlock branch June 25, 2026 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants