Conversation
To determine whether the container rootfs or a container mount is block-based, urunc retrieves mount information from /proc/self/mountinfo. The corresponding function (`getMountInfo`) currently returns only the mount source, mount point, and filesystem type. Based on this information, urunc checks whether the guest supports the filesystem type, and if so, attaches the mount source as a block device to the sandbox. However, there are cases where multiple mount entries appear to share the same source. An exmaple is bind mounts where the mount source of a bind mount appears to be the underlying device of the source path. To handle this, this commit extends `getMountInfo` with an additional check. It collects the mount sources of all non-special filesystems into a map. The source of the mount corresponding to the target path is intentionally excluded from this map. If at the end of the searching the source of the target mount does not exist in the map, it is considered safe to attach as a block device. If it does exist, it is not safe to use, since the same source is mounted elsewhere or it is not a block-based mount. Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
As a first step for another refactor of Exec move the chooseRootfs function to be a mthod of unikontainers. This allows us to call chooseRootfs from other parts of the code so we can retrieve the rootfs that will be used. Moreover, it will help us with the refactoring of the Exec monolith and separate the various parts of the Exec. Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
Some actions in the preparation of the block based rootfs for the sandbox must take place before the preparation of the rootfs for the execution environment of the monitor. In particular, the unmount of the block-based container rootfs in the host must take place before the prepareRoot and prepareMonRootfs functions. This is the reason of the bug we had where the unmount of the block-based contianer's rootfs in the host was not propagating to other mount namespaces. Given the opportunity, define a set of functions related to the handling of the sandbox's rootfs preparation. This commit only focuses on block-based ones. Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
Following the definition of specific functions for the prearation of the block-based rootfs for the sandbox, implement the same functions for initrd based rootfs. Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
Following the definition of specific functions for the prearation of the block-based rootfs for the sandbox, implement the same functions for shared-fs based rootfs. Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
Following the definition of specific functions for the preparation of the block-based rootfs for the sandbox, implement the same functions for the cases where the sandbox does not have a rootfs. Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
…otfs Define a generic interface for the preparation of the sandbox's rootfs and make all rootfs types to follow this interface. Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
Move the tmpfs creation in the rootfs for the execution environment of the monitor process to the postSetup step of the rootfs handling for all rootfs types. Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
Decouple the check for a valid path from findNS and check for a valid anmespace path after the return of findNS. The function findNS is also used to check if a specific namespaces was declared in the container configuration. Furthermore, the check for a valid path had to take place also for the cases where the path was empty and we were checking for a namespace of a specific process ID. Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
✅ Deploy Preview for urunc canceled.
|
775b50c to
394b124
Compare
Ensure that the container's rootfs has the MS_SHARED propagation flag. THis is necessary in order to propagate any unmounts that might happen later (e.g. in the case of block-based snapshots which are attached to the sandbox) in the mount peer groups (i.e. in the initial mount namespace). THere is no problem to do that for every rootfs, because reexec will later cut off the propagation during the preparation of the monitor;s rootfs. In the future though, we need to move this up in the shim. Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
394b124 to
da92017
Compare
|
thanks @cmainas. The MS_SHARED fix in InitialSetup is correct — verified that the block rootfs unmount now propagates and #562 is resolved. However, the findNS change is doing the opposite of what the description says. Quick recap of the description:
I built both branches (main & PR) with a probe at the call site and ran rumprun-hvt via Same OCI spec in both runs: The standard containerd shape with empty Path on the mount ns. main was already pivoting in that case; this PR changes it to chroot. Cause: findNS no longer errors on empty Path, but the call site in Exec is unmodified: The comment above this block was already inverted from the code on main, so Would it make sense to split this PR? I'd merge the MS_SHARED + getMountInfo + refactor |
Description
This PR takes care of various bugs we had with the handling of the rootfs, but all related to the related issue below. The main problem was that when the
reexecprocess was unmounting the block-based snapshot which was mounted as the container;s rootfs in the host, the unmounting was not propagating to the other mount namespaces (especially the initial one where the mounting of the snapshot was taking place).The thing is that usually the container;s rootfs in the host is mounted with
MS_SHAREDpropagation flag. Therefore all mount / unmount events taking place in subsequent mount namespaces (or peer mount groups) should also propagate. In simple words the unmount inreexecshould propagate, except if something was cutting of the propagation.Well, indeed something was cutting off the propagation (See https://github.com/urunc-dev/urunc/blob/main/pkg/unikontainers/unikontainers.go#L341). As part of preparing the rootfs for the execution environment of the sandbox monitor, we had to set the propagation flags according to the spec we received for the new mount namespace and make sure that the parent of the monitor's rootfs in the host will be
MS_PRIVATE. This is necessary for pivot to work.All the above are necessary steps, but they cut off the mount propagation of mounts and when we perform the unmount later in https://github.com/urunc-dev/urunc/blob/main/pkg/unikontainers/unikontainers.go#L362 the unmount will never propagate. Subsequently the solution is simple: make sure the rootfs of the monitor process in the host does have
MS_SHAREDas propagation flag and unmount before cutting off the mount propagation.Furthermore, to avoid any further surprises with block based mounts and rootfs, a simple check is added in
getMountInfoto always exclude mount sources which appear more than once in/proc/self/mountinfo.Due to the high cognitive complexity of Exec a quick fix would add extra complexity as shown in the 3rd commit. Since we are already planning to perform a major refactor for Exec and unikontainers, I brought parts of this refactor in this PR, in order to help with the further refactoring later.
At last, there was a bug in checking for a specific namespace since
findNSwas also checking for the path of the namespace in/procand hence failing for new namespaces that the runtime needs to create. As a result, when we had to create a mount namespace we were not pivoting but chroot.Related issues
How was this tested?
With the end-to-end tests and to make sure the unmounting propagates, after running a
urunccontainer which can accept a block-based device as the sandbox's rootfs themountcommand should not show this block image mounted.LLM usage
N/A
Checklist
make lint).make test_ctr,make test_nerdctl,make test_docker,make test_crictl).