Skip to content

dockerfile: run buildkitd within a cgroup namespace for cgroup v2#6368

Merged
tonistiigi merged 2 commits intomoby:masterfrom
marxarelli:review/unshare-cgroupns-entrypoint
Feb 24, 2026
Merged

dockerfile: run buildkitd within a cgroup namespace for cgroup v2#6368
tonistiigi merged 2 commits intomoby:masterfrom
marxarelli:review/unshare-cgroupns-entrypoint

Conversation

@marxarelli
Copy link
Copy Markdown
Contributor

Introduce a new entrypoint script for the Linux image that, if cgroup v2 is in use, creates a new cgroup and mount namespace for buildkitd within a new entrypoint using unshare and remounts /sys/fs/cgroup to restrict its view of the unified cgroup hierarchy. This will ensure its init cgroup and all OCI worker managed cgroups are kept beneath the root cgroup of the initial entrypoint process.

When buildkitd is run in a managed environment like Kubernetes without its own cgroup namespace (the default behavior of privileged pods in Kubernetes where cgroup v2 is in use; see cgroup v2 KEP), the OCI worker will spawn processes in cgroups that are outside of the cgroup hierarchy that was created for the buildkitd container, leading to incorrect resource accounting and enforcement which in turn can cause OOM errors and CPU contention on the node.

Example behavior without this change:

root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/buildkit/{runc-container-id}

Example behavior with this change:

root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/buildkit/{runc-container-id}

Note this was developed as an alternative approach to #6343

@marxarelli
Copy link
Copy Markdown
Contributor Author

@tonistiigi this is the alternative approach I mentioned in #6343 (comment).

Note that I first tried to implement the ns creation and remounting in buildkitd using calls to unix.Unshare and unix.Mount but encountered some strange behavior: The main buildkitd process was placed in a new cgroup namespace but for some reason buildkit-runc was not. It may have been that not all Go threads were moved into the cgroup, I'm not sure.

In any case, using unshare in the entrypoint seems less error prone.

@marxarelli marxarelli force-pushed the review/unshare-cgroupns-entrypoint branch 2 times, most recently from 3cf93c3 to 7a50ed7 Compare November 17, 2025 20:57
@AkihiroSuda
Copy link
Copy Markdown
Member

When buildkitd is run in a managed environment like Kubernetes without its own cgroup namespace (the default behavior of privileged pods in Kubernetes where cgroup v2 is in use; see cgroup v2 KEP), the OCI worker will spawn processes in cgroups that are outside of the cgroup hierarchy that was created for the buildkitd container, leading to incorrect resource accounting and enforcement which in turn can cause OOM errors and CPU contention on the node.

In the long term can we just extend Kubernetes to support unsharing cgroupns?

@marxarelli
Copy link
Copy Markdown
Contributor Author

In the long term can we just extend Kubernetes to support unsharing cgroupns?

That would be ideal if Kubernetes had a field in SecurityContext for controlling that.

FWIW we've been using a custom entrypoint based on this PR for a couple of weeks. No issues so far.

EOF
ENV BUILDKIT_SETUP_CGROUPV2_ROOT=1
ENTRYPOINT ["buildkitd"]
ENTRYPOINT ["/usr/bin/buildkitd-entrypoint"]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of docker run --privileged this script does not seem needed, as Docker unshares cgroupns even for privileged mode.
So this entrypoint script should be opt-in.
It should be also marked as a workaround until Kubernetes supports unsharing cgroupns.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening that KEP @AkihiroSuda. I added some words of support in the PR discussion.

I'll work on refactoring this PR to be opt-in as you requested.

Copy link
Copy Markdown
Contributor Author

@marxarelli marxarelli Jan 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of docker run --privileged this script does not seem needed, as Docker unshares cgroupns even for privileged mode. So this entrypoint script should be opt-in. It should be also marked as a workaround until Kubernetes supports unsharing cgroupns.

The entrypoint should already handle this case heuristically by testing whether the current cgroup path is "/" (which is true when run via docker run --privileged unless --cgroupns host is also given). In that case, it skips unshare and the with-cgroupfs-remount wrapper script and just does exec buildkitd.

I could also introduce an additional environment variable if you think that's best, but I personally quite like the heuristic behavior.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added a comment about it being a workaround.

Copy link
Copy Markdown
Member

@AkihiroSuda AkihiroSuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we reuse the BUILDKIT_SETUP_CGROUPV2_ROOT and build this into buildkitd (or a different "UNSHARE" env)? Or if not possible then I think I would prefer these as separate scripts in hack rather than inline heredocs.

@marxarelli
Copy link
Copy Markdown
Contributor Author

Can't we reuse the BUILDKIT_SETUP_CGROUPV2_ROOT and build this into buildkitd (or a different "UNSHARE" env)? Or if not possible then I think I would prefer these as separate scripts in hack rather than inline heredocs.

I actually tried an implementation in buildkitd but had quite a bit of trouble with it. Specifically I could not reliably affect all threads with the call to syscall.Unshare and the OCI worker would continue to operate outside of the new cgroup/mount namespace.

I will move the scripts into hack.

@marxarelli marxarelli force-pushed the review/unshare-cgroupns-entrypoint branch from 4f59d47 to 8faed50 Compare January 22, 2026 23:29
@github-actions github-actions bot added the area/hack building buildkit itself label Jan 22, 2026
Introduce a new entrypoint script for the Linux image that, if cgroup v2
is in use, creates a new cgroup and mount namespace for buildkitd within
a new entrypoint using `unshare` and remounts `/sys/fs/cgroup` to
restrict its view of the unified cgroup hierarchy. This will ensure its
`init` cgroup and all OCI worker managed cgroups are kept beneath the
root cgroup of the initial entrypoint process.

When buildkitd is run in a managed environment like Kubernetes without
its own cgroup namespace (the default behavior of privileged pods in
Kubernetes where cgroup v2 is in use; see [cgroup v2 KEP][kep]), the OCI
worker will spawn processes in cgroups that are outside of the cgroup
hierarchy that was created for the buildkitd container, leading to
incorrect resource accounting and enforcement which in turn can cause
OOM errors and CPU contention on the node.

Example behavior without this change:

```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/buildkit/{runc-container-id}
```

Example behavior with this change:

```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/buildkit/{runc-container-id}
```

Note this was developed as an alternative approach to moby#6343

[kep]: https://github.com/kubernetes/enhancements/tree/6d3210f7dd5d547c8f7f6a33af6a09eb45193cd7/keps/sig-node/2254-cgroup-v2#cgroup-namespace

Signed-off-by: Dan Duvall <dduvall@wikimedia.org>
@marxarelli marxarelli force-pushed the review/unshare-cgroupns-entrypoint branch from 8faed50 to 9a1bf2a Compare January 23, 2026 17:31
@marxarelli
Copy link
Copy Markdown
Contributor Author

ping @tonistiigi

@tonistiigi tonistiigi added this to the v0.28.0 milestone Feb 23, 2026
Copy link
Copy Markdown
Member

@tonistiigi tonistiigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marxarelli tiny nit, otherwise we can get this in

Signed-off-by: CrazyMax <1951866+crazy-max@users.noreply.github.com>
@tonistiigi tonistiigi merged commit 4a388e1 into moby:master Feb 24, 2026
221 of 223 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/hack building buildkit itself area/project

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants