Re: [RFC PATCH v2 2/3] vfs: transitive upgrade restrictions for fds

From: Aleksa Sarai

Date: Wed Apr 01 2026 - 08:27:03 EST

On 2026-03-29, Jori Koolstra <jkoolstra@xxxxxxxxx> wrote:
> > Unfortunately there is a pretty big issue with doing it this way (which
> > I mentioned in one of the changelogs back then[2]): There are lots of
> > VFS operations that imply operations on a file (through a magic-link)
> > that are not blocked. truncate(2) and mount(MS_BIND)/open_tree(2) are
> > the most problematic examples, but this applies to basically any syscall
> > that takes a path argument. If you don't block those then re-opening
> > restrictions are functionally useless.
>
> Ah yes. If I am correct, it would block all the fXXX syscalls from doing
> harm (at least w.r.t. read/write operations) because they use the fmode for
> rights checking on the fd, and this cannot be changed without going through
> an open() variant.
>
> Hence the issue is the case when we pass a magic path to e.g. truncate()
> as it does no upgrade restriction check right now on the struct file. So
> we hade to do this for every relevant syscall. And the question is... where.

They would also be bypassed by AT_EMPTY_PATH, so you don't even need
magic-links to cause the issue. This is why I considered doing it with
"struct path" and doing the blocking in *_permission() -- all of the
standard vfs operations already have .

> > It also would be really nice if you could block more than just trailing
> > component operations -- having a directory file descriptor that blocks
> > lookups could be quite handy for a bunch of reasons.
>
> Yes, so (at the very least) we also want RESTRICT_LOOKUP for directory fds.

Well, definitely not initially. You also don't really need to call it
RESTRICT_LOOKUP, blocking exec would be the "unixy" thing to do (and it
would be nice to have that to block fexecveat(2) too, but that is a
_VERY_ deep rabbithole).

> > I think the only workable solution to block all of these issue entirely
> > and in a comprehensive way is to have something akin to capsicum
> > capabilities[3] tied to file descriptors and have all of the VFS
> > operations check them (though I think that the way this was attempted in
> > the past[4] was far from ideal).
> >
>
> I went through Drysdale's implementation a bit. He links the capability check
> to the translation of an fd to a struct file. I agree this is a bit invasive
> (as he writes himself), and perhaps we can do better. Is this what you mean
> by "far from ideal"?

There were quite a few other issues with it, but that is one of them.

> > I have tried my hand at a few lighter-weight prototypes over the years
> > (mainly trying to add the necessary checks to every generic_permission()
> > call, and adding some more generic_permission() calls as well...). My
> > last prototype was adding the restriction information to "struct path"
> > but that would bloat too many structures to be merge-able. I was
> > planning on looking at this again later this year, but if you can come
> > up with a nice way of getting a minimal version of capsicum working,
> > that'd be amazing. :D
>
> I would really like to try; it is a very nice problem for me to tackle;
> you need to gain experience somehow :)

Sorry, I probably should've couched my reply with a bit more caution --
I really appreciate the enthusiasm, but this is quite a hairy topic for
quite a number of reasons. It will involve touching everything
fd-related in the kernel, it touches on an already-rejected patchset,
and is a topic that can get very heated easily. I would not rate the
chance of getting something merged very highly, and I think there are
more fruitful things for you to tackle and get your hands dirty.

Honestly, I think most people will be more than happy with just getting
the O_EMPTYPATH part merged -- looking back, I really should've just
done that back in 2019. :/

> I wonder how checking all this in generic_permission() would work. The
> access to the fd that the procfs magic link provides is essentially an
> issue of path traversal, and in generic_permission() you just have the
> inode in question. Ah but of course, you can use the mode bits of the
> magic link to encode the information, as you suggest. What downside did
> you encounter using this idea?

It wasn't based on the magic-link mode, I added a capability set to
"struct path" and then went and adjusted all generic_permission()s to
take the new capability mask. For stuff that did file_permission() or
path_permission(), no code change was needed.

Unfortunately this ran into quite a few hairy issues that made it
basically unmergeable (bloating "struct path" is very bad as it is
embedded everywhere, the whole model is kind of questionable because it
makes "struct path" have state about the lookup that produced it, some
VFS APIs deal with inodes or dentries directly and thus would need more
tree-wide fixes, and most importantly it just felt really ugly).

The open_tree(2) / mount(MS_BIND) issue is particularly problematic --
ideally you would want to block a RW bind-mount for a read-only file,
but there isn't really an obvious mechanism for passing down those kinds
of restrictions. Actually, fsconfig() arguments like "upperdir" to
overlayfs (or even "source") are likely even worse -- now you need
per-filesystem-option handling.

I don't think these issues are insurmountable, it's just that the
problem is much harder than it looks at first glance and I would humbly
suggest that working on reviving a very dead patchset is not the best
use of your time. Container runtimes really *really* want this, which is
the main reason I keep coming back to it (and we ended up with it in the
UAPI group proposal page) but I think it's a very niche thing that has a
big risk of being a lot of wasted effort.

--
Aleksa Sarai
https://www.cyphar.com/

Attachment: signature.asc
Description: PGP signature