Clean up stale gh-aw containers on self-hosted runners before binding gateway ports

## Summary

A self-hosted runner canary failed before the agent started because the MCP gateway container could not bind the fixed host port `8080`:

```text
failed to listen on 0.0.0.0:8080: listen tcp 0.0.0.0:8080: bind: address already in use
```

This looks like stale gh-aw container/runtime state remaining on a persistent self-hosted runner. The generated workflow should clean up gh-aw-owned containers/processes/ports when needed, or otherwise avoid fixed-port collisions, so subsequent jobs on the same runner do not fail during gateway startup.

## Run details

- Workflow type: self-hosted Ubuntu 24.04 x64 host runner canary
- Run ID: `27450109858`
- Job ID: `81143711103`
- Event: `push`
- Run attempt: `2`
- `gh aw` CLI: `v0.79.6`
- AWF/firewall: `v0.27.2`
- MCP gateway image: `ghcr.io/github/gh-aw-mcpg:v0.3.25`
- GitHub MCP server image: `ghcr.io/github/github-mcp-server:v1.1.2`

## What happened

I ran the debugger flow from `debug.md`:

```bash
gh aw audit 27450109858 --json
```

The audit reported:

- run status: `completed`
- conclusion: `failure`
- failing job: `agent`
- `detection` and `safe_outputs` skipped
- turns/tool usage: `0`, because the agent never got past gateway startup

The generated workflow sets a fixed gateway port:

```bash
export MCP_GATEWAY_PORT="8080"
```

and starts the gateway with Docker host networking:

```bash
docker run -i --rm --network host ... ghcr.io/github/gh-aw-mcpg:v0.3.25
```

The gateway starts initialization and then exits:

```text
[info] Gateway started with PID: 36949
[info] Waiting for gateway to initialize...
[error] ERROR: Gateway process (PID: 36949) exited during initialization
[INFO] Gateway will listen on 0.0.0.0:8080
[INFO] Command: ./awmg --routed --listen 0.0.0.0:8080 --config-stdin --log-dir /tmp/gh-aw/mcp-logs/
failed to listen on 0.0.0.0:8080: listen tcp 0.0.0.0:8080: bind: address already in use
```

## Expected behavior

On persistent self-hosted runners, gh-aw should be resilient to leftover state from prior runs. Before starting containers/listening on fixed host ports, it should do one or more of the following:

- remove/stop stale gh-aw-owned containers that may still be running from previous jobs;
- clean up gh-aw-owned listeners/processes that are still bound to the gateway port;
- label/name containers with enough run metadata to safely identify stale gh-aw resources;
- allocate non-conflicting per-run ports instead of always using `8080`; or
- fail with a diagnostic that identifies the container/process holding the port and suggests the cleanup command.

## Why this matters

Hosted runners are ephemeral, but self-hosted runners persist across jobs. If gh-aw leaves containers or host-port listeners behind, later runs can fail before any agent turn, producing skipped detection/safe-output jobs and making the canary look like an agent failure when the issue is runner hygiene/runtime cleanup.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clean up stale gh-aw containers on self-hosted runners before binding gateway ports #38979

Summary

Run details

What happened

Expected behavior

Why this matters

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Clean up stale gh-aw containers on self-hosted runners before binding gateway ports #38979

Description

Summary

Run details

What happened

Expected behavior

Why this matters

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions