Skip to content

[ARC-DinD] auto-detection should also engage --docker-host-path-prefix when DOCKER_HOST is a unix socket from a sibling daemon pod #33777

Description

@boydj

Summary

gh-aw v0.74.4's runtime detection for ARC/DinD only adds
--docker-host-path-prefix /tmp/gh-aw to the AWF invocation when
DOCKER_HOST matches ^tcp:// (PR #31996, broadened from
^tcp://(localhost|127\.0\.0\.1) on 2026-05-13). On Actions Runner
Controller deployments where the runner pod mounts a sibling daemon
pod's docker socket as unix:///var/run/docker.sock, the regex never
fires, so --docker-host-path-prefix is never enabled. The agent
container then bind-mounts paths the daemon cannot see, and the run
fails.

This is the same root condition #30840 / #30838 / #28888 describe — the
runner and daemon filesystems are split — but in our shape DOCKER_HOST
is a unix socket because the docker socket is bind-mounted into the
runner pod from the daemon pod, not exposed over TCP.

Symptom

The agent harness fails immediately on prompt-file read:

[entrypoint] Executing command: /bin/bash -c '... --prompt-file /tmp/gh-aw/aw-prompts/prompt.txt'
[claude-harness] fatal: --prompt-file '/tmp/gh-aw/aw-prompts/prompt.txt'
  is not readable: ENOENT: no such file or directory, stat '/tmp/gh-aw/aw-prompts/prompt.txt'

Activation phase wrote the prompt to /tmp/gh-aw/aw-prompts/prompt.txt
on the runner. AWF launched with:

sudo -E awf --config "${RUNNER_TEMP}/gh-aw/awf-config.json" \
  --container-workdir "${GITHUB_WORKSPACE}" \
  --mount "${RUNNER_TEMP}/gh-aw:${RUNNER_TEMP}/gh-aw:ro" \
  --mount "${RUNNER_TEMP}/gh-aw:/host${RUNNER_TEMP}/gh-aw:ro" \
  ${GH_AW_DOCKER_HOST_PATH_PREFIX_ARGS} \
  ...

GH_AW_DOCKER_HOST_PATH_PREFIX_ARGS was empty because:

if [[ "${DOCKER_HOST:-}" =~ ^tcp:// ]]; then
  GH_AW_DOCKER_HOST_PATH_PREFIX_ARGS="--docker-host-path-prefix /tmp/gh-aw"
fi

didn't match — DOCKER_HOST=unix:///var/run/docker.sock. So AWF emitted
bind-mounts of /tmp/gh-aw and ${RUNNER_TEMP}/gh-aw that the daemon
resolves against its own (different) filesystem.

Probe results

We ran a diagnostic workflow (diag-runner-state.yml) on this runner
that demonstrates both halves of the failure mode:

DOCKER_HOST=unix:///var/run/docker.sock
srw-rw---- 1 root 2375 0 May 21 14:31 /var/run/docker.sock

probe likely TCP daemon endpoints
  unreachable: tcp://localhost:2375
  unreachable: tcp://localhost:2376
  unreachable: tcp://127.0.0.1:2375
  unreachable: tcp://127.0.0.1:2376
  unreachable: tcp://dind:2375
  unreachable: tcp://dind:2376
  unreachable: tcp://docker-dind:2375
  unreachable: tcp://dockerd:2375

cross-check daemon-side filesystem
  -rw-r--r-- 1 runner runner 38 May 21 14:42 /tmp/runner-side-sentinel-900
  --- as seen from a container with /tmp bind-mounted ---
  ls: /tmp/runner-side-sentinel-900: No such file or directory
  MISSING

Two confirmed facts:

  1. The daemon pod does not expose any standard docker TCP port that the
    runner can reach. The only TCP services visible to the runner pod
    are kube-apiserver and a buildkit daemon — not docker.
  2. A sentinel file written from the runner at /tmp/runner-side-sentinel-X
    is invisible inside a container with -v /tmp:/tmp. The daemon
    resolves /tmp against its own pod filesystem.

So the existing detection logic — "TCP DOCKER_HOST implies split
filesystems" — is correct in spirit but incomplete. Unix-socket
DOCKER_HOST also implies split filesystems when the socket is
bind-mounted from a sibling pod, which is a common ARC topology.

Reproduction

ARC RunnerScaleSet configured with a sibling daemon pod and the
docker socket bind-mounted into the runner. No special workflow config
required — any AWF-backed agent workflow will fail on the prompt-file
read inside the agent container.

Suggested fix

A pure regex on DOCKER_HOST is insufficient because unix-socket
sibling-DinD looks the same as host-local docker from the env-var
perspective. The runtime probe needs another signal.

Two options that both work:

  1. Setup-time inode probe. Before composing the AWF flags, write a
    sentinel file to /tmp/gh-aw/.dind-probe-$$ on the runner, then run
    docker run --rm -v /tmp/gh-aw:/x alpine ls /x/.dind-probe-$$. If
    the sentinel is missing on the daemon side, the daemon filesystem is
    split, and --docker-host-path-prefix should be engaged regardless
    of DOCKER_HOST scheme. Cleanup the sentinel afterwards. This is
    one extra docker run per agent run — cheap.

  2. Stat-based check. stat -c '%d %i' /tmp/gh-aw on the runner vs.
    the same call inside a container with -v /tmp/gh-aw:/x. Different
    (device, inode) ⇒ split filesystem ⇒ engage prefix path. Slightly
    subtler than option 1.

Option 1 is more direct and harder to misread.

A third less-elegant option: an explicit frontmatter override
(sandbox.compatibility: arc-dind per the proposal in #30840) so users
can force the path-prefix path on. We've been working around the issue
locally by tar-piping /tmp/gh-aw and ${RUNNER_TEMP}/gh-aw from the
runner into the daemon filesystem before AWF launches, but that's the
sort of hand-rolled workaround #30840 explicitly calls out as too
brittle for normal users.

Why not just expose TCP on the daemon pod?

Cluster admins controlling the runner image sometimes will, sometimes
won't. On the deployment we hit this on (trusted-environments-gh-aw
running on a private GHES instance), the daemon pod is configured
socket-only and we don't have permission to change it. The unix-socket
sibling-DinD shape is supportable in principle — gh-aw just can't
detect it yet.

Related

Environment

  • gh-aw v0.74.4, AWF v0.25.46
  • Runner: ARC trusted-environments-gh-aw on Kubernetes, Ubuntu 24.04,
    Docker 29.5.1
  • DOCKER_HOST: unix:///var/run/docker.sock (bind-mounted from sibling
    daemon pod)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions