Re: [PATCH v9 4/5] proc: Skip the visibility check if subset=pid is used

From: Aleksa Sarai

Date: Thu Apr 16 2026 - 11:04:38 EST


On 2026-04-16, Christian Brauner <brauner@xxxxxxxxxx> wrote:
> On Thu, Apr 16, 2026 at 10:46:50PM +1000, Aleksa Sarai wrote:
> > On 2026-04-16, Aleksa Sarai <cyphar@xxxxxxxxxx> wrote:
> > > On 2026-04-13, Alexey Gladkov <legion@xxxxxxxxxx> wrote:
> > > > When procfs is mounted with the subset=pid option, all system files and
> > > > directories from the root of the filesystem are not accessible in
> > > > userspace. Only dynamic information about processes is available, which
> > > > cannot be hidden with overmount.
> > > >
> > > > For this reason, checking for full visibility is not relevant if mounting
> > > > is performed with the subset=pid option.
> > > >
> > > > Signed-off-by: Alexey Gladkov <legion@xxxxxxxxxx>
> > > > ---
> > >
> > > > -static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags)
> > > > +static bool mount_too_revealing(struct fs_context *fc, int *new_mnt_flags)
> > > > {
> > > > const unsigned long required_iflags = SB_I_NOEXEC | SB_I_NODEV;
> > > > struct mnt_namespace *ns = current->nsproxy->mnt_ns;
> > > > + const struct super_block *sb = fc->root->d_sb;
> > > > unsigned long s_iflags;
> > > >
> > > > if (ns->user_ns == &init_user_ns)
> > > > @@ -6388,7 +6387,7 @@ static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags
> > > > return true;
> > > > }
> > > >
> > > > - return !mnt_already_visible(ns, sb, new_mnt_flags);
> > > > + return (!fc->skip_visibility && !mnt_already_visible(ns, sb, new_mnt_flags));
> > > > }
> > >
> > > Unless I'm missing something (I haven't tested this locally yet, sorry),
> > > this will allow you to bypass mount_too_revealing() even for
> > > non-subset=pid mounts because once you create a subset=pid mount then a
> > > regular procfs mount will see the subset=pid mount and permit it.
> > >
> > > I think the solution is quite simple -- you can also skip super-blocks
> > > that have fc->skip_visibility set in mnt_already_visible().
> >
> > I now see that check was present in v8 but I guess its importance wasn't
> > obvious. I guess this means we will need to reintroduce
> > SB_I_USERNS_ALLOW_REVEALING. :/
>
> I've been playing with something else. So first we should move
> SB_I_USERNS_REVEALING to an fs_type flag. It's not an optional thing and
> always set and never removed. That also means we can simplify
> sysfs_get_tree() to just kernfs_get_tree().

Seems quite reasonable to me.

> And then we raise SB_I_USERNS_RESTRICTED on all procfs mounts with
> pid_only and disallow using them for calculating mount permissions for
> unrestricted procfs mounts.

I think that's fine here and is probably all we need for now (though I
think that the name is a little confusing especially given my next
comment), but I think there is a bit of a deeper problem here that
deserves to be mentioned if only for posterity.

The core issue really is that we actually have two versions of procfs
now, and they are effectively distinguished by fs_context flags instead
of fs_type. Back when subset=pid was merged there was a discussion about
a hypothetical proc2 -- while that got dropped, we are kind of
reinventing this split in an ad-hoc way.

Conceptually considering them as two distinct fs_types would be the more
semantically correct thing to do, though I think it would be overkill to
come up with some framework for that. (I also really hope we don't have
any other filesystems where this is also the case.)

This is why I think the name is a little weird, since it isn't an issue
about one procfs being restricted in a userns, it's that they are
conceptually unrelated filesystem types. A very lightweight version of
this would be making it something like SB_I_RESTRICTED_VARIANT that
would be more a little more strict than what you have now --
SB_I_RESTRICTED_VARIANT would not be treated as compatible with anything
(even another SB_I_RESTRICTED_VARIANT mount) so that if we add more
variants in the future we don't treat them the same either. (Again, I
don't think we need this now?) Of course this would still allow
subset=pid without a procfs mount because of the other special casing
for it in mount_too_revealing().

We are also limited in what semantics we can change here too (and I'm a
little worried that even this bit for SB_I_USERNS_RESTRICTED is at risk
of breaking something) so maybe that isn't needed today.

--
Aleksa Sarai
https://www.cyphar.com/

Attachment: signature.asc
Description: PGP signature