Re: [PATCH v9 4/5] proc: Skip the visibility check if subset=pid is used

From: Christian Brauner

Date: Tue Apr 21 2026 - 07:52:25 EST

On Fri, Apr 17, 2026 at 01:03:46AM +1000, Aleksa Sarai wrote:
> On 2026-04-16, Christian Brauner <brauner@xxxxxxxxxx> wrote:
> > On Thu, Apr 16, 2026 at 10:46:50PM +1000, Aleksa Sarai wrote:
> > > On 2026-04-16, Aleksa Sarai <cyphar@xxxxxxxxxx> wrote:
> > > > On 2026-04-13, Alexey Gladkov <legion@xxxxxxxxxx> wrote:
> > > > > When procfs is mounted with the subset=pid option, all system files and
> > > > > directories from the root of the filesystem are not accessible in
> > > > > userspace. Only dynamic information about processes is available, which
> > > > > cannot be hidden with overmount.
> > > > >
> > > > > For this reason, checking for full visibility is not relevant if mounting
> > > > > is performed with the subset=pid option.
> > > > >
> > > > > Signed-off-by: Alexey Gladkov <legion@xxxxxxxxxx>
> > > > > ---
> > > >
> > > > > -static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags)
> > > > > +static bool mount_too_revealing(struct fs_context *fc, int *new_mnt_flags)
> > > > > {
> > > > > const unsigned long required_iflags = SB_I_NOEXEC | SB_I_NODEV;
> > > > > struct mnt_namespace *ns = current->nsproxy->mnt_ns;
> > > > > + const struct super_block *sb = fc->root->d_sb;
> > > > > unsigned long s_iflags;
> > > > >
> > > > > if (ns->user_ns == &init_user_ns)
> > > > > @@ -6388,7 +6387,7 @@ static bool mount_too_revealing(const struct super_block *sb, int *new_mnt_flags
> > > > > return true;
> > > > > }
> > > > >
> > > > > - return !mnt_already_visible(ns, sb, new_mnt_flags);
> > > > > + return (!fc->skip_visibility && !mnt_already_visible(ns, sb, new_mnt_flags));
> > > > > }
> > > >
> > > > Unless I'm missing something (I haven't tested this locally yet, sorry),
> > > > this will allow you to bypass mount_too_revealing() even for
> > > > non-subset=pid mounts because once you create a subset=pid mount then a
> > > > regular procfs mount will see the subset=pid mount and permit it.
> > > >
> > > > I think the solution is quite simple -- you can also skip super-blocks
> > > > that have fc->skip_visibility set in mnt_already_visible().
> > >
> > > I now see that check was present in v8 but I guess its importance wasn't
> > > obvious. I guess this means we will need to reintroduce
> > > SB_I_USERNS_ALLOW_REVEALING. :/
> >
> > I've been playing with something else. So first we should move
> > SB_I_USERNS_REVEALING to an fs_type flag. It's not an optional thing and
> > always set and never removed. That also means we can simplify
> > sysfs_get_tree() to just kernfs_get_tree().
>
> Seems quite reasonable to me.
>
> > And then we raise SB_I_USERNS_RESTRICTED on all procfs mounts with
> > pid_only and disallow using them for calculating mount permissions for
> > unrestricted procfs mounts.
>
> I think that's fine here and is probably all we need for now (though I
> think that the name is a little confusing especially given my next
> comment), but I think there is a bit of a deeper problem here that
> deserves to be mentioned if only for posterity.
>
> The core issue really is that we actually have two versions of procfs
> now, and they are effectively distinguished by fs_context flags instead
> of fs_type. Back when subset=pid was merged there was a discussion about
> a hypothetical proc2 -- while that got dropped, we are kind of
> reinventing this split in an ad-hoc way.
>
> Conceptually considering them as two distinct fs_types would be the more
> semantically correct thing to do, though I think it would be overkill to
> come up with some framework for that. (I also really hope we don't have
> any other filesystems where this is also the case.)

Doing this is almost trivial. The problem really is that we'd expose
"procfs2" or whatever as a new mount option to userspace without
actually having fundamentally revamped what we would like procfs2 to do.
IOW, it would elevate a security hack that got added a while ago to a
new filesystem type. I think that would just be sad. So I'm not so
excited about doing this.

> This is why I think the name is a little weird, since it isn't an issue
> about one procfs being restricted in a userns, it's that they are
> conceptually unrelated filesystem types. A very lightweight version of

I somewhat disagree because it's unclear where you draw the line. If you
use the hidepid= stuff at its most restrictive where it only shows tasks
your user owns you could also argue that it's very close to a separate
filesystem type. It's strictly subtractive and so is pidonly. There's
just nothing that's better or novel.

> this would be making it something like SB_I_RESTRICTED_VARIANT that
> would be more a little more strict than what you have now --
> SB_I_RESTRICTED_VARIANT would not be treated as compatible with anything
> (even another SB_I_RESTRICTED_VARIANT mount) so that if we add more
> variants in the future we don't treat them the same either. (Again, I
> don't think we need this now?) Of course this would still allow
> subset=pid without a procfs mount because of the other special casing
> for it in mount_too_revealing().

It's not a problem today. I think we can simply treat
SB_I_RESTRICTED_VARIANT as not allowing other restricted variants. And
we'd need a more elaborate mechanism anyway once we'd have "compatible"
subsets.

> We are also limited in what semantics we can change here too (and I'm a
> little worried that even this bit for SB_I_USERNS_RESTRICTED is at risk
> of breaking something) so maybe that isn't needed today.

So I think yes, SB_I_RESTRICTED_VARIANT is fine by me for now.