dockerfile: run buildkitd within a cgroup namespace for cgroup v2#6368
Conversation
|
@tonistiigi this is the alternative approach I mentioned in #6343 (comment). Note that I first tried to implement the ns creation and remounting in buildkitd using calls to In any case, using |
3cf93c3 to
7a50ed7
Compare
In the long term can we just extend Kubernetes to support unsharing cgroupns? |
That would be ideal if Kubernetes had a field in FWIW we've been using a custom entrypoint based on this PR for a couple of weeks. No issues so far. |
| EOF | ||
| ENV BUILDKIT_SETUP_CGROUPV2_ROOT=1 | ||
| ENTRYPOINT ["buildkitd"] | ||
| ENTRYPOINT ["/usr/bin/buildkitd-entrypoint"] |
There was a problem hiding this comment.
In the case of docker run --privileged this script does not seem needed, as Docker unshares cgroupns even for privileged mode.
So this entrypoint script should be opt-in.
It should be also marked as a workaround until Kubernetes supports unsharing cgroupns.
There was a problem hiding this comment.
There was a problem hiding this comment.
Thanks for opening that KEP @AkihiroSuda. I added some words of support in the PR discussion.
I'll work on refactoring this PR to be opt-in as you requested.
There was a problem hiding this comment.
In the case of
docker run --privilegedthis script does not seem needed, as Docker unshares cgroupns even for privileged mode. So this entrypoint script should be opt-in. It should be also marked as a workaround until Kubernetes supports unsharing cgroupns.
The entrypoint should already handle this case heuristically by testing whether the current cgroup path is "/" (which is true when run via docker run --privileged unless --cgroupns host is also given). In that case, it skips unshare and the with-cgroupfs-remount wrapper script and just does exec buildkitd.
I could also introduce an additional environment variable if you think that's best, but I personally quite like the heuristic behavior.
There was a problem hiding this comment.
Just added a comment about it being a workaround.
7a50ed7 to
4f59d47
Compare
tonistiigi
left a comment
There was a problem hiding this comment.
Can't we reuse the BUILDKIT_SETUP_CGROUPV2_ROOT and build this into buildkitd (or a different "UNSHARE" env)? Or if not possible then I think I would prefer these as separate scripts in hack rather than inline heredocs.
I actually tried an implementation in I will move the scripts into hack. |
4f59d47 to
8faed50
Compare
Introduce a new entrypoint script for the Linux image that, if cgroup v2
is in use, creates a new cgroup and mount namespace for buildkitd within
a new entrypoint using `unshare` and remounts `/sys/fs/cgroup` to
restrict its view of the unified cgroup hierarchy. This will ensure its
`init` cgroup and all OCI worker managed cgroups are kept beneath the
root cgroup of the initial entrypoint process.
When buildkitd is run in a managed environment like Kubernetes without
its own cgroup namespace (the default behavior of privileged pods in
Kubernetes where cgroup v2 is in use; see [cgroup v2 KEP][kep]), the OCI
worker will spawn processes in cgroups that are outside of the cgroup
hierarchy that was created for the buildkitd container, leading to
incorrect resource accounting and enforcement which in turn can cause
OOM errors and CPU contention on the node.
Example behavior without this change:
```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/buildkit/{runc-container-id}
```
Example behavior with this change:
```console
root@k8s-node:/# cat /proc/$(pgrep -n buildkitd)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/init
root@k8s-node:/# cat /proc/$(pgrep -n some-build-process)/cgroup
0::/kubepods/burstable/pod{pod-id}/{container-id}/buildkit/{runc-container-id}
```
Note this was developed as an alternative approach to moby#6343
[kep]: https://github.com/kubernetes/enhancements/tree/6d3210f7dd5d547c8f7f6a33af6a09eb45193cd7/keps/sig-node/2254-cgroup-v2#cgroup-namespace
Signed-off-by: Dan Duvall <dduvall@wikimedia.org>
8faed50 to
9a1bf2a
Compare
|
ping @tonistiigi |
tonistiigi
left a comment
There was a problem hiding this comment.
@marxarelli tiny nit, otherwise we can get this in
Signed-off-by: CrazyMax <1951866+crazy-max@users.noreply.github.com>
Introduce a new entrypoint script for the Linux image that, if cgroup v2 is in use, creates a new cgroup and mount namespace for buildkitd within a new entrypoint using
unshareand remounts/sys/fs/cgroupto restrict its view of the unified cgroup hierarchy. This will ensure itsinitcgroup and all OCI worker managed cgroups are kept beneath the root cgroup of the initial entrypoint process.When buildkitd is run in a managed environment like Kubernetes without its own cgroup namespace (the default behavior of privileged pods in Kubernetes where cgroup v2 is in use; see cgroup v2 KEP), the OCI worker will spawn processes in cgroups that are outside of the cgroup hierarchy that was created for the buildkitd container, leading to incorrect resource accounting and enforcement which in turn can cause OOM errors and CPU contention on the node.
Example behavior without this change:
Example behavior with this change:
Note this was developed as an alternative approach to #6343