Skip to content

Clean up stale gh-aw containers on self-hosted runners before binding gateway ports #38979

Description

@zarenner

Summary

A self-hosted runner canary failed before the agent started because the MCP gateway container could not bind the fixed host port 8080:

failed to listen on 0.0.0.0:8080: listen tcp 0.0.0.0:8080: bind: address already in use

This looks like stale gh-aw container/runtime state remaining on a persistent self-hosted runner. The generated workflow should clean up gh-aw-owned containers/processes/ports when needed, or otherwise avoid fixed-port collisions, so subsequent jobs on the same runner do not fail during gateway startup.

Run details

  • Workflow type: self-hosted Ubuntu 24.04 x64 host runner canary
  • Run ID: 27450109858
  • Job ID: 81143711103
  • Event: push
  • Run attempt: 2
  • gh aw CLI: v0.79.6
  • AWF/firewall: v0.27.2
  • MCP gateway image: ghcr.io/github/gh-aw-mcpg:v0.3.25
  • GitHub MCP server image: ghcr.io/github/github-mcp-server:v1.1.2

What happened

I ran the debugger flow from debug.md:

gh aw audit 27450109858 --json

The audit reported:

  • run status: completed
  • conclusion: failure
  • failing job: agent
  • detection and safe_outputs skipped
  • turns/tool usage: 0, because the agent never got past gateway startup

The generated workflow sets a fixed gateway port:

export MCP_GATEWAY_PORT="8080"

and starts the gateway with Docker host networking:

docker run -i --rm --network host ... ghcr.io/github/gh-aw-mcpg:v0.3.25

The gateway starts initialization and then exits:

[info] Gateway started with PID: 36949
[info] Waiting for gateway to initialize...
[error] ERROR: Gateway process (PID: 36949) exited during initialization
[INFO] Gateway will listen on 0.0.0.0:8080
[INFO] Command: ./awmg --routed --listen 0.0.0.0:8080 --config-stdin --log-dir /tmp/gh-aw/mcp-logs/
failed to listen on 0.0.0.0:8080: listen tcp 0.0.0.0:8080: bind: address already in use

Expected behavior

On persistent self-hosted runners, gh-aw should be resilient to leftover state from prior runs. Before starting containers/listening on fixed host ports, it should do one or more of the following:

  • remove/stop stale gh-aw-owned containers that may still be running from previous jobs;
  • clean up gh-aw-owned listeners/processes that are still bound to the gateway port;
  • label/name containers with enough run metadata to safely identify stale gh-aw resources;
  • allocate non-conflicting per-run ports instead of always using 8080; or
  • fail with a diagnostic that identifies the container/process holding the port and suggests the cleanup command.

Why this matters

Hosted runners are ephemeral, but self-hosted runners persist across jobs. If gh-aw leaves containers or host-port listeners behind, later runs can fail before any agent turn, producing skipped detection/safe-output jobs and making the canary look like an agent failure when the issue is runner hygiene/runtime cleanup.

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions