Re: [PATCH v8 4/5] proc: Relax check of mount visibility

From: Christian Brauner

Date: Tue Feb 17 2026 - 07:00:06 EST

On Fri, Feb 13, 2026 at 11:44:29AM +0100, Alexey Gladkov wrote:
> When /proc is mounted with the subset=pid option, all system files from
> the root of the file system are not accessible in userspace. Only
> dynamic information about processes is available, which cannot be
> hidden with overmount.
>
> For this reason, checking for full visibility is not relevant if
> mounting is performed with the subset=pid option.
>
> Signed-off-by: Alexey Gladkov <legion@xxxxxxxxxx>
> ---
> fs/namespace.c | 29 ++++++++++++++++-------------
> fs/proc/root.c | 17 ++++++++++-------
> include/linux/fs/super_types.h | 2 ++
> 3 files changed, 28 insertions(+), 20 deletions(-)
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index c58674a20cad..7daa86315c05 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -6116,7 +6116,8 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
> /* This mount is not fully visible if it's root directory
> * is not the root directory of the filesystem.
> */
> - if (mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
> + if (!(sb->s_iflags & SB_I_USERNS_ALLOW_REVEALING) &&
> + mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)
> continue;
>
> /* A local view of the mount flags */
> @@ -6136,18 +6137,20 @@ static bool mnt_already_visible(struct mnt_namespace *ns,
> ((mnt_flags & MNT_ATIME_MASK) != (new_flags & MNT_ATIME_MASK)))
> continue;

There are a few things that I find problematic here.

Even before your change the mount flags of the first fully visible
procfs mount would be picked up. If the caller was unlucky they could
stumble upon the most restricted procfs mount in the mount namespace
rbtree. Leading to weird scenarios where a user cannot write to the
procfs instance they just mounted but could to another one that is also
in their namespace.

The other thing is that with this change specifically:

if (!(sb->s_iflags & SB_I_USERNS_ALLOW_REVEALING) &&
mnt->mnt.mnt_root != mnt->mnt.mnt_sb->s_root)

we start caring about mount options of even partially exposed procfs
mounts. IOW, if someone had a bind-mount of e.g., /proc/pressure
somewhere that got inherited via CLONE_NEWNS then we suddenly take the
mount options of that into account for a new /proc/<pid>/* only instance.
I think we should continue caring only about procfs mounts that are
visible from their root.

The the other problem is that it is really annoying that we walk all
mounts in a mount namespace just to find procfs and sysfs mounts in
there. Currently a lot of workloads still do the CLONE_NEWNS dance
meaning they inherit all the crap from the host and then proceed to
setup their new rootfs. Busy container workloads that can be a lot.

So let's just be honest about it and treat procfs and sysfs as the
snowflakes that they have become and record their instances in a
separate per mount namespace hlist as in the (untested) patch below [1].

Also SB_I_USERNS_ALLOW_REVEALING seems unnecessary. The only time we
care about that flag is when we setup a new superblock. So this could
easily be a struct fs_context bitfield that just exists for the duration
of the creation of the new superblock and mount. So maybe pass that down
to mount_too_revealing() and further down into the actual helper.

[1]: