Skip to content

Files on Detached Mounts#2754

Draft
bsach64 wants to merge 7 commits intocheckpoint-restore:criu-devfrom
bsach64:detached-mounts
Draft

Files on Detached Mounts#2754
bsach64 wants to merge 7 commits intocheckpoint-restore:criu-devfrom
bsach64:detached-mounts

Conversation

@bsach64
Copy link
Member

@bsach64 bsach64 commented Sep 23, 2025

This PR adds initial support for files on detached bind mounts.
Detached mounts are mounts which are unmounted using the MNT_DETACH flag. This commit adds support for files opened on detached mounts which are a bind mount of a existing mount (which is not detached).

This PR requires the kernel patches in this github branch to be applied:
https://github.com/bsach64/linux/tree/umount-mnt-ns-plus-statx
https://github.com/bsach64/linux/tree/statmount-fd-v2
https://github.com/bsach64/linux/tree/statmount-fd-v3
https://github.com/bsach64/linux/tree/statmount-fd-v4
https://github.com/bsach64/linux/tree/statmount-fd-v5

This PR is pretty draft, we are still to support a lot of things. This work is part of Google Summer of Code 2025.

@bsach64 bsach64 force-pushed the detached-mounts branch 2 times, most recently from 7f3bd9d to cca3849 Compare October 27, 2025 08:20
@Snorch Snorch self-requested a review October 29, 2025 04:41
By "detached mount" we refer to a mount, unmounted with the MNT_DETACH
flag.

Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
"anonymous" mounts are those created via open_tree with OPEN_TREE_CLONE
and get "umount" when the fd returned by open_tree is closed.

Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
We will be using statmount to add support for files on "unmounted"
mounts. "unmounted" mounts refer to a mount that has been unmounted
using umount2(mnt, MNT_DETACH) but we still have access to files on that
mount.

Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
For supporting files on "unmounted" mounts (mounts umounted with
MNT_DETACH), we need support for statmount on fds. Add kerndat check to
check support for statmount on fd.

Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
Detached mounts are mounts which are unmounted using the MNT_DETACH
flag. This commit adds support for files opened on detached mounts which
are a bind mount of a existing mount (which is not detached)

Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
brauner added a commit to brauner/linux that referenced this pull request Dec 4, 2025
Bhavik Sachdev <b.sachdev1904@gmail.com> says:

We would like to add support for checkpoint/restoring file descriptors
open on these "unmounted" mounts to CRIU (Checkpoint/Restore in
Userspace) [1].

Currently, we have no way to get mount info for these "unmounted" mounts
since they do appear in /proc/<pid>/mountinfo and statmount does not
work on them, since they do not belong to any mount namespace.

This patch helps us by providing a way to get mountinfo for these
"unmounted" mounts by using a fd on the mount.

Changes from v6 [2] to v7:
* Add kselftests for STATMOUNT_BY_FD flag.

* Instead of renaming mnt_id_req.mnt_ns_fd to mnt_id_req.fd introduce a
union so struct mnt_id_req looks like this:

    struct mnt_id_req {
            __u32 size;
            union {
                    __u32 mnt_ns_fd;
                    __u32 mnt_fd;
            };
            __u64 mnt_id;
            __u64 param;
            __u64 mnt_ns_id;
    };

* In case of STATMOUNT_BY_FD grab mnt_ns inside of do_statmount(),
since we get mnt_ns from mnt, which should happen under namespace lock.

* Remove the modifications made to grab_requested_mnt_ns, those were
never needed.

Changes from v5 [3] to v6:
* Instead of returning "[unmounted]" as the mount point for "unmounted"
mounts, we unset the STATMOUNT_MNT_POINT flag in statmount.mask.

* Instead of returning 0 as the mnt_ns_id for "unmounted" mounts, we
unset the STATMOUNT_MNT_NS_ID flag in statmount.mask.

* Added comment in `do_statmount` clarifying that the caller sets s->mnt
in case of STATMOUNT_BY_FD.

* In `do_statmount` move the mnt_ns_id and mnt_ns_empty() check just
before lookup_mnt_in_ns().

* We took another look at the capability checks for getting information
for "unmounted" mounts using an fd and decided to remove them for the
following reasons:

  - All fs related information is available via fstatfs() without any
    capability check.

  - Mount information is also available via /proc/pid/mountinfo (without
    any capability check).

  - Given that we have access to a fd on the mount which tells us that
    we had access to the mount at some point (or someone that had access
    gave us the fd). So, we should be able to access mount info.

Changes from v4 [4] to v5:
Check only for s->root.mnt to be NULL instead of checking for both
s->root.mnt and s->root.dentry (I did not find a case where only one of
them would be NULL).

* Only allow system root (CAP_SYS_ADMIN in init_user_ns) to call
statmount() on fd's on "unmounted" mounts. We (mostly Pavel) spent some
time thinking about how our previous approach (of checking the opener's
file credentials) caused problems.

Please take a look at the linked pictures they describe everything more
clearly.

Case 1: A fd is on a normal mount (Link to Picture: [5])
Consider, a situation where we have two processes P1 and P2 and a file
F1. F1 is opened on mount ns M1 by P1. P1 is nested inside user
namespace U1 and U2. P2 is also in U1. P2 is also in a pid namespace and
mount namespace separate from M1.

P1 sends F1 to P2 (using a unix socket). But, P2 is unable to call
statmount() on F1 because since it is a separate pid and mount
namespace. This is good and expected.

Case 2: A fd is on a "unmounted" mount (Link to Picture: [6])
Consider a similar situation as Case 1. But now F1 is on a mounted that
has been "unmounted". Now, since we used openers credentials to check
for permissions P2 ends up having the ability call statmount() and get
mount info for this "unmounted" mount.

Hence, It is better to restrict the ability to call statmount() on fds
on "unmounted" mounts to system root only (There could also be other
cases than the one described above).

Changes from v3 [7] to v4:
* Change the string returned when there is no mountpoint to be
"[unmounted]" instead of "[detached]".
* Remove the new DEFINE_FREE put_file and use the one already present in
include/linux/file.h (fput) [8].
* Inside listmount consistently pass 0 in flags to copy_mnt_id_req and
prepare_klistmount()->grab_requested_mnt_ns() and remove flags from the
prepare_klistmount prototype.
* If STATMOUNT_BY_FD is set, check for mnt_ns_id == 0 && mnt_id == 0.

Changes from v2 [9] to v3:
* Rename STATMOUNT_FD flag to STATMOUNT_BY_FD.
* Fixed UAF bug caused by the reference to fd_mount being bound by scope
of CLASS(fd_raw, f)(kreq.fd) by using fget_raw instead.
* Reused @Spare parameter in mnt_id_req instead of adding new fields to
the struct.

Changes from v1 [10] to v2:
v1 of this patchset, took a different approach and introduced a new
umount_mnt_ns, to which "unmounted" mounts would be moved to (instead of
their namespace being NULL) thus allowing them to be still available via
statmount.

Introducing umount_mnt_ns complicated namespace locking and modified
performance sensitive code [11] and it was agreed upon that fd-based
statmount would be better.

This code is also available on github [12].

[1]: checkpoint-restore/criu#2754
[2]: https://lore.kernel.org/all/20251118084836.2114503-1-b.sachdev1904@gmail.com/
[3]: https://lore.kernel.org/criu/20251109053921.1320977-2-b.sachdev1904@gmail.com/T/#u
[4]: https://lore.kernel.org/all/20251029052037.506273-2-b.sachdev1904@gmail.com/
[5]: https://github.com/bsach64/linux/blob/statmount-fd-v5/fd_on_normal_mount.png
[6]: https://github.com/bsach64/linux/blob/statmount-fd-v5/file_on_unmounted_mount.png
[7]: https://lore.kernel.org/all/20251024181443.786363-1-b.sachdev1904@gmail.com/
[8]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/file.h#n97
[9]: https://lore.kernel.org/linux-fsdevel/20251011124753.1820802-1-b.sachdev1904@gmail.com/
[10]: https://lore.kernel.org/linux-fsdevel/20251002125422.203598-1-b.sachdev1904@gmail.com/
[11]: https://lore.kernel.org/linux-fsdevel/7e4d9eb5-6dde-4c59-8ee3-358233f082d0@virtuozzo.com/
[12]: https://github.com/bsach64/linux/tree/statmount-fd-v7

* patches from https://patch.msgid.link/20251129091455.757724-1-b.sachdev1904@gmail.com:
  selftests: statmount: tests for STATMOUNT_BY_FD
  statmount: accept fd as a parameter
  statmount: permission check should return EPERM

Link: https://patch.msgid.link/20251129091455.757724-1-b.sachdev1904@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
externalmirrors-syncer bot pushed a commit to external-mirrors/linux-next that referenced this pull request Dec 15, 2025
Bhavik Sachdev <b.sachdev1904@gmail.com> says:

We would like to add support for checkpoint/restoring file descriptors
open on these "unmounted" mounts to CRIU (Checkpoint/Restore in
Userspace) [1].

Currently, we have no way to get mount info for these "unmounted" mounts
since they do appear in /proc/<pid>/mountinfo and statmount does not
work on them, since they do not belong to any mount namespace.

This patch helps us by providing a way to get mountinfo for these
"unmounted" mounts by using a fd on the mount.

Changes from v6 [2] to v7:
* Add kselftests for STATMOUNT_BY_FD flag.

* Instead of renaming mnt_id_req.mnt_ns_fd to mnt_id_req.fd introduce a
union so struct mnt_id_req looks like this:

    struct mnt_id_req {
            __u32 size;
            union {
                    __u32 mnt_ns_fd;
                    __u32 mnt_fd;
            };
            __u64 mnt_id;
            __u64 param;
            __u64 mnt_ns_id;
    };

* In case of STATMOUNT_BY_FD grab mnt_ns inside of do_statmount(),
since we get mnt_ns from mnt, which should happen under namespace lock.

* Remove the modifications made to grab_requested_mnt_ns, those were
never needed.

Changes from v5 [3] to v6:
* Instead of returning "[unmounted]" as the mount point for "unmounted"
mounts, we unset the STATMOUNT_MNT_POINT flag in statmount.mask.

* Instead of returning 0 as the mnt_ns_id for "unmounted" mounts, we
unset the STATMOUNT_MNT_NS_ID flag in statmount.mask.

* Added comment in `do_statmount` clarifying that the caller sets s->mnt
in case of STATMOUNT_BY_FD.

* In `do_statmount` move the mnt_ns_id and mnt_ns_empty() check just
before lookup_mnt_in_ns().

* We took another look at the capability checks for getting information
for "unmounted" mounts using an fd and decided to remove them for the
following reasons:

  - All fs related information is available via fstatfs() without any
    capability check.

  - Mount information is also available via /proc/pid/mountinfo (without
    any capability check).

  - Given that we have access to a fd on the mount which tells us that
    we had access to the mount at some point (or someone that had access
    gave us the fd). So, we should be able to access mount info.

Changes from v4 [4] to v5:
Check only for s->root.mnt to be NULL instead of checking for both
s->root.mnt and s->root.dentry (I did not find a case where only one of
them would be NULL).

* Only allow system root (CAP_SYS_ADMIN in init_user_ns) to call
statmount() on fd's on "unmounted" mounts. We (mostly Pavel) spent some
time thinking about how our previous approach (of checking the opener's
file credentials) caused problems.

Please take a look at the linked pictures they describe everything more
clearly.

Case 1: A fd is on a normal mount (Link to Picture: [5])
Consider, a situation where we have two processes P1 and P2 and a file
F1. F1 is opened on mount ns M1 by P1. P1 is nested inside user
namespace U1 and U2. P2 is also in U1. P2 is also in a pid namespace and
mount namespace separate from M1.

P1 sends F1 to P2 (using a unix socket). But, P2 is unable to call
statmount() on F1 because since it is a separate pid and mount
namespace. This is good and expected.

Case 2: A fd is on a "unmounted" mount (Link to Picture: [6])
Consider a similar situation as Case 1. But now F1 is on a mounted that
has been "unmounted". Now, since we used openers credentials to check
for permissions P2 ends up having the ability call statmount() and get
mount info for this "unmounted" mount.

Hence, It is better to restrict the ability to call statmount() on fds
on "unmounted" mounts to system root only (There could also be other
cases than the one described above).

Changes from v3 [7] to v4:
* Change the string returned when there is no mountpoint to be
"[unmounted]" instead of "[detached]".
* Remove the new DEFINE_FREE put_file and use the one already present in
include/linux/file.h (fput) [8].
* Inside listmount consistently pass 0 in flags to copy_mnt_id_req and
prepare_klistmount()->grab_requested_mnt_ns() and remove flags from the
prepare_klistmount prototype.
* If STATMOUNT_BY_FD is set, check for mnt_ns_id == 0 && mnt_id == 0.

Changes from v2 [9] to v3:
* Rename STATMOUNT_FD flag to STATMOUNT_BY_FD.
* Fixed UAF bug caused by the reference to fd_mount being bound by scope
of CLASS(fd_raw, f)(kreq.fd) by using fget_raw instead.
* Reused @Spare parameter in mnt_id_req instead of adding new fields to
the struct.

Changes from v1 [10] to v2:
v1 of this patchset, took a different approach and introduced a new
umount_mnt_ns, to which "unmounted" mounts would be moved to (instead of
their namespace being NULL) thus allowing them to be still available via
statmount.

Introducing umount_mnt_ns complicated namespace locking and modified
performance sensitive code [11] and it was agreed upon that fd-based
statmount would be better.

This code is also available on github [12].

[1]: checkpoint-restore/criu#2754
[2]: https://lore.kernel.org/all/20251118084836.2114503-1-b.sachdev1904@gmail.com/
[3]: https://lore.kernel.org/criu/20251109053921.1320977-2-b.sachdev1904@gmail.com/T/#u
[4]: https://lore.kernel.org/all/20251029052037.506273-2-b.sachdev1904@gmail.com/
[5]: https://github.com/bsach64/linux/blob/statmount-fd-v5/fd_on_normal_mount.png
[6]: https://github.com/bsach64/linux/blob/statmount-fd-v5/file_on_unmounted_mount.png
[7]: https://lore.kernel.org/all/20251024181443.786363-1-b.sachdev1904@gmail.com/
[8]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/file.h#n97
[9]: https://lore.kernel.org/linux-fsdevel/20251011124753.1820802-1-b.sachdev1904@gmail.com/
[10]: https://lore.kernel.org/linux-fsdevel/20251002125422.203598-1-b.sachdev1904@gmail.com/
[11]: https://lore.kernel.org/linux-fsdevel/7e4d9eb5-6dde-4c59-8ee3-358233f082d0@virtuozzo.com/
[12]: https://github.com/bsach64/linux/tree/statmount-fd-v7

* patches from https://patch.msgid.link/20251129091455.757724-1-b.sachdev1904@gmail.com:
  selftests: statmount: tests for STATMOUNT_BY_FD
  statmount: accept fd as a parameter
  statmount: permission check should return EPERM

Link: https://patch.msgid.link/20251129091455.757724-1-b.sachdev1904@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
@github-actions
Copy link

A friendly reminder that this PR had no activity for 30 days.

@bsach64
Copy link
Member Author

bsach64 commented Feb 23, 2026

FYI: Changes required to the kernel to support this feature namely STATMOUNT_BY_FD have been merged and are a part of the upcoming v7.0 release 🥳!
I will start working on this PR now and turn it into a mergeable state!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant