Skip to content

Fix block and ref rootfs#572

Open
cmainas wants to merge 10 commits intomainfrom
fix_block_and_ref_rootfs
Open

Fix block and ref rootfs#572
cmainas wants to merge 10 commits intomainfrom
fix_block_and_ref_rootfs

Conversation

@cmainas
Copy link
Copy Markdown
Contributor

@cmainas cmainas commented Apr 23, 2026

Description

This PR takes care of various bugs we had with the handling of the rootfs, but all related to the related issue below. The main problem was that when the reexec process was unmounting the block-based snapshot which was mounted as the container;s rootfs in the host, the unmounting was not propagating to the other mount namespaces (especially the initial one where the mounting of the snapshot was taking place).

The thing is that usually the container;s rootfs in the host is mounted with MS_SHARED propagation flag. Therefore all mount / unmount events taking place in subsequent mount namespaces (or peer mount groups) should also propagate. In simple words the unmount in reexec should propagate, except if something was cutting of the propagation.

Well, indeed something was cutting off the propagation (See https://github.com/urunc-dev/urunc/blob/main/pkg/unikontainers/unikontainers.go#L341). As part of preparing the rootfs for the execution environment of the sandbox monitor, we had to set the propagation flags according to the spec we received for the new mount namespace and make sure that the parent of the monitor's rootfs in the host will be MS_PRIVATE. This is necessary for pivot to work.

All the above are necessary steps, but they cut off the mount propagation of mounts and when we perform the unmount later in https://github.com/urunc-dev/urunc/blob/main/pkg/unikontainers/unikontainers.go#L362 the unmount will never propagate. Subsequently the solution is simple: make sure the rootfs of the monitor process in the host does have MS_SHARED as propagation flag and unmount before cutting off the mount propagation.

Furthermore, to avoid any further surprises with block based mounts and rootfs, a simple check is added in getMountInfo to always exclude mount sources which appear more than once in /proc/self/mountinfo.

Due to the high cognitive complexity of Exec a quick fix would add extra complexity as shown in the 3rd commit. Since we are already planning to perform a major refactor for Exec and unikontainers, I brought parts of this refactor in this PR, in order to help with the further refactoring later.

At last, there was a bug in checking for a specific namespace since findNS was also checking for the path of the namespace in /proc and hence failing for new namespaces that the runtime needs to create. As a result, when we had to create a mount namespace we were not pivoting but chroot.

Related issues

How was this tested?

With the end-to-end tests and to make sure the unmounting propagates, after running a urunc container which can accept a block-based device as the sandbox's rootfs the mount command should not show this block image mounted.

LLM usage

N/A

Checklist

  • I have read the contribution guide.
  • The linter passes locally (make lint).
  • The e2e tests of at least one tool pass locally (make test_ctr, make test_nerdctl, make test_docker, make test_crictl).
  • If LLMs were used: I have read the llm policy.

To determine whether the container rootfs or a container mount is
block-based, urunc retrieves mount information from
/proc/self/mountinfo. The corresponding function (`getMountInfo`)
currently returns only the mount source, mount point, and filesystem
type. Based on this information, urunc checks whether the guest
supports the filesystem type, and if so, attaches the mount source
as a block device to the sandbox.

However, there are cases where multiple mount entries appear to share
the same source. An exmaple is bind mounts where the mount source of a
bind mount appears to be the underlying device of the source path.

To handle this, this commit extends `getMountInfo` with an additional
check. It collects the mount sources of all non-special filesystems into
a map. The source of the mount corresponding to the target path is
intentionally excluded from this map. If at the end of the searching the
source of the target mount does not exist in the map, it is considered
safe to attach as a block device. If it does exist, it is not safe to
use, since the same source is mounted elsewhere or it is not a
block-based mount.

Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
As a first step for another refactor of Exec move the chooseRootfs
function to be a mthod of unikontainers. This allows us to call
chooseRootfs from other parts of the code so we can retrieve the rootfs
that will be used. Moreover, it will help us with the refactoring of the
Exec monolith and separate the various parts of the Exec.

Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
Some actions in the preparation of the block based rootfs for the
sandbox must take place before the preparation of the rootfs for the
execution environment of the monitor. In particular, the unmount of the
block-based container rootfs in the host must take place before the
prepareRoot and prepareMonRootfs functions. This is the reason of the
bug we had where the unmount of the block-based contianer's rootfs in
the host was not propagating to other mount namespaces.

Given the opportunity, define a set of functions related to the handling
of the sandbox's rootfs preparation. This commit only focuses on
block-based ones.

Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
Following the definition of specific functions for the prearation of the
block-based rootfs for the sandbox, implement the same functions for
initrd based rootfs.

Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
Following the definition of specific functions for the prearation of the
block-based rootfs for the sandbox, implement the same functions for
shared-fs based rootfs.

Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
Following the definition of specific functions for the preparation of the
block-based rootfs for the sandbox, implement the same functions for
the cases where the sandbox does not have a rootfs.

Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
…otfs

Define a generic interface for the preparation of the sandbox's rootfs
and make all rootfs types to follow this interface.

Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
Move the tmpfs creation in the rootfs for the execution environment of
the monitor process to the postSetup step of the rootfs handling
for all rootfs types.

Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
Decouple the check for a valid path from findNS and check for a valid
anmespace path after the return of findNS. The function findNS is also
used to check if a specific namespaces was declared in the container
configuration. Furthermore, the check for a valid path had to take place
also for the cases where the path was empty and we were checking for a
namespace of a specific process ID.

Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 23, 2026

Deploy Preview for urunc canceled.

Name Link
🔨 Latest commit da92017
🔍 Latest deploy log https://app.netlify.com/projects/urunc/deploys/69eb3ec2ebd2e3000853eb37

@cmainas cmainas force-pushed the fix_block_and_ref_rootfs branch 7 times, most recently from 775b50c to 394b124 Compare April 24, 2026 09:38
Ensure that the container's rootfs has the MS_SHARED propagation flag.
THis is necessary in order to propagate any unmounts that might happen
later (e.g. in the case of block-based snapshots which are attached to
the sandbox) in the mount peer groups (i.e. in the initial
mount namespace).

THere is no problem to do that for every rootfs, because reexec will
later cut off the propagation during the preparation of the monitor;s
rootfs.

In the future though, we need to move this up in the shim.

Signed-off-by: Charalampos Mainas <cmainas@nubificus.co.uk>
@cmainas cmainas force-pushed the fix_block_and_ref_rootfs branch from 394b124 to da92017 Compare April 24, 2026 09:58
@cmainas cmainas marked this pull request as ready for review April 24, 2026 11:34
@cmainas cmainas requested a review from ananos April 24, 2026 11:34
@ananos
Copy link
Copy Markdown
Contributor

ananos commented May 1, 2026

thanks @cmainas.

The MS_SHARED fix in InitialSetup is correct — verified that the block rootfs unmount now propagates and #562 is resolved.

However, the findNS change is doing the opposite of what the description says. Quick recap of the description:

"when we had to create a mount namespace we were not pivoting but chroot".

I built both branches (main & PR) with a probe at the call site and ran rumprun-hvt via ctr run:

  main:    findNS(mount)=("", err="the namespace does not exist")             
           withPivot=true  -> PIVOT_ROOT                            
  this PR: findNS(mount)=("", err=<nil>)                                      
           withPivot=false -> CHROOT

Same OCI spec in both runs:

namespaces=[{Type:pid} {Type:ipc} {Type:uts} {Type:mount} {Type:network}]

The standard containerd shape with empty Path on the mount ns.

main was already pivoting in that case; this PR changes it to chroot.

Cause: findNS no longer errors on empty Path, but the call site in Exec is unmodified:

    _, err = findNS(u.Spec.Linux.Namespaces, specs.MountNamespace)  
    withPivot := err != nil   // was a side-effect of the old findNS contract

The comment above this block was already inverted from the code on main, so
the original intent isn't fully clear. I think you want:

    nsPath, err := findNS(u.Spec.Linux.Namespaces, specs.MountNamespace)
    // own the ns (in spec, no path) -> pivot; joining an existing ns -> chroot                                                                               
    withPivot := err == nil && nsPath == ""                                   

Would it make sense to split this PR? I'd merge the MS_SHARED + getMountInfo + refactor
parts as-is, and address the findNS / pivot-vs-chroot question in a separate
PR with a test, since it's an isolation-level change rather than a bug fix. Maybe I'm missing something in the process, so feel free to ignore the findNS analysis. Regardless, let's try to expedite the merge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Block based rootfs is not unmounted from host before attaching to sandbox

2 participants