fix(network-isolation): break topology-attach ordering deadlock starving cli-proxy health gate#5544
Conversation
There was a problem hiding this comment.
Pull request overview
Restructures container startup in --network-isolation + --topology-attach runs to eliminate a startup-ordering deadlock where docker compose up -d blocks on cli-proxy health before topology peers are attached to awf-net.
Changes:
- Add an optional
onNetworkReadycallback tostartContainers()and split startup sosquid-proxy(andawf-net) come up before attaching topology peers, then perform the normal health-gated full bring-up. - Wire topology peer attachment into the new callback in
runMainWorkflow()and remove the post-start attachment step. - Harden
cli-proxyliveness probe diagnostics by classifying DNS-related resolution failures.
Show a summary per file
| File | Description |
|---|---|
| src/container-start.test.ts | Adds unit tests covering phased startup sequencing and retry behavior under the new callback flow. |
| src/container-lifecycle.ts | Implements phased startup via onNetworkReady and updates the function contract/docs. |
| src/cli-workflow.ts | Passes topology attachment as onNetworkReady during container startup and removes redundant post-start attachment. |
| src/cli-workflow.test.ts | Updates workflow tests to validate the callback-based topology attachment behavior. |
| containers/cli-proxy/entrypoint.sh | Improves liveness probe failure classification for DNS-related errors. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 5/5 changed files
- Comments generated: 4
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
|
✅ Copilot review passed with no inline comments. @copilot Add the |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
|
Chroot tests passed! Smoke Chroot - All security and functionality tests succeeded. |
|
✅ Build Test Suite completed successfully! |
|
❌ Security Guard failed. Please review the logs for details. |
|
✅ Smoke Copilot BYOK AOAI (api-key) completed. Copilot AOAI BYOK (api-key) mode operational. 🔓 |
|
✅ Smoke Gemini completed. All facets verified. 💎 |
|
📰 VERDICT: Smoke Copilot has concluded. All systems operational. This is a developing story. 🎤 |
|
📡 Smoke OTel Tracing completed. All tracing scenarios validated. ✅ |
|
🔌 Smoke Services — All services reachable! ✅ |
|
✅ Smoke Claude passed |
|
✨ The prophecy is fulfilled... Smoke Codex has completed its mystical journey. The stars align. 🌟 |
|
✅ Smoke Copilot BYOK completed. Copilot BYOK mode operational. 🔓 |
|
✅ Contribution Check completed successfully! Contribution guidelines review complete for PR #5544: no important gaps found in the provided CONTRIBUTING.md checklist context. |
|
✅ Smoke Copilot BYOK AOAI (Entra) completed. Copilot AOAI BYOK (Entra) mode operational. 🔓 |
|
🔑 Smoke Copilot PAT PAT auth validated. All systems operational. ✅ |
Smoke Test: Claude Engine Validation
Overall result: PASS
|
|
Smoke Test: Copilot BYOK (Direct) Mode ✅
Status: PASS — Running in direct BYOK mode (COPILOT_PROVIDER_API_KEY) via api-proxy → api.githubcopilot.com
|
🔥 Smoke Test Results
Overall: PASS PR: fix(network-isolation): break topology-attach ordering deadlock starving cli-proxy health gate
|
🔬 Smoke Test: Copilot PAT Auth — PASS
PR: fix(network-isolation): break topology-attach ordering deadlock starving cli-proxy health gate
|
Chroot Version Comparison Results
Overall: ❌ Not all tests passed — Python and Node.js versions differ between host and chroot environments.
|
|
✅ GitHub MCP Testing Running in direct BYOK mode (AWF_AUTH_TYPE=github-oidc + AWF_AUTH_AZURE_* + COPILOT_PROVIDER_BASE_URL) via api-proxy → Azure OpenAI (Foundry, o4-mini-aw) authenticated via Microsoft Entra Overall status: PASS
|
🔭 Smoke Test: API Proxy OpenTelemetry Tracing
All scenarios pass. OTEL tracing integration is healthy.
|
Smoke Test: Gemini Engine Validation
Overall status: FAIL Warning Firewall blocked 1 domainThe following domain was blocked by the firewall during workflow execution:
network:
allowed:
- defaults
- "localhost"See Network Configuration for more information.
|
|
Merged PRs reviewed:
Warning Firewall blocked 1 domainThe following domain was blocked by the firewall during workflow execution:
network:
allowed:
- defaults
- "registry.npmjs.org"See Network Configuration for more information.
|
|
|
🏗️ Build Test Suite Results
Overall: 8/8 ecosystems passed — ✅ PASS
|
Smoke Test: GitHub Actions Services Connectivity
Overall: FAIL
|
In
--network-isolation+--topology-attachmode, the cli-proxy liveness probe fires before topology peers are joined toawf-net, causingEAI_AGAIN→ fail-fast → agent never invoked. The deadlock is structural:docker compose up -dblocks until cli-proxy is healthy, but the peer attach that would satisfy that probe only ran afterstartContainers()returned.Fix A — Phased startup (eliminates the deadlock)
startContainers()now accepts an optionalonNetworkReadycallback. When provided, startup is split into three phases before the existing health-gated bring-up:docker compose up -d --no-deps squid-proxy— createsawf-netwithout waiting on dependentsonNetworkReady()— attaches topology peers toawf-netdocker compose up -d— full bring-up; cli-proxy probe now resolves the peercli-workflow.tswiresconnectTopologyContainersas theonNetworkReadyhook and removes the now-redundant post-startup Step 2.5. Non-topology runs (noonNetworkReady) are completely unaffected; all existing api-proxy/squid one-shot retry logic is preserved in Phase 3.Fix B — DNS failure classification in cli-proxy entrypoint (hardening)
The liveness probe classifier now recognises
EAI_AGAIN/ENOTFOUND/getaddrinfoasdns-not-yet-readyrather thanunknown. With Fix A this path should never be hit, but the classification ensures residual races surface with a clear diagnostic rather than a confusing unknown-failure message.