Re: [PATCH 0/2] proc: protect ptrace_may_access() with exec_update_lock

From: Christian Brauner

Date: Wed Jun 10 2026 - 03:49:51 EST

On Fri, Jun 05, 2026 at 04:34:47PM +0200, Jann Horn wrote:
> On Fri, Jun 5, 2026 at 2:54 PM Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:
> > > On Thu, May 28, 2026 at 3:11 PM Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:
> > >> Jann Horn <jannh@xxxxxxxxxx> writes:
> > >> The issue you point out with memfd's definitely needs to be fixed.
> > >> It should be separated out from the rest of the races simply because
> > >> it is a completely different kind of issue.
> > >>
> > >> I wonder if anyone even anticipated you could open another file handle
> > >> to memfd's through proc. If so leaving everything to path based
> > >> permissions assumed a feature of proc that doesn't exist.
> > >
> > > I don't think memfds are particularly special here, they are just a
> > > nice, clear example of a case where an inode is protected based on
> > > which processes can path-walk to it.
> > >
> > > As another example: Making a directory mode 0700 is also supposed to
> > > prevent other users from accessing things inside it.
> >
> > The simple counter example is that linux has an open by inode
> > facility. That is exposed to nfs and as a syscall.
> >
> > Well strictly speaking the syscalls name_to_handle_at and
> > open_by_handle_at.
> >
> > Most filesystems including shmemfs support those operations.
> > See shmem_export_ops.
> >
> > Which is a long way of saying that if someone can guess
> > the inode and generation number of a memfd inode it can
> > be opened with open_by_handle_at. The usual permission checks
> > are performed but unless I am misreading something the only
> > permission checks that are relevant are the permissions on
> > the inode.
>
> No. open_by_handle_at() is not supposed to let you bypass
> non-executable directories.
>
> Filesystems without a special export_operations::permission handler go through:
>
> do_handle_open -> handle_to_path -> may_decode_fh
>
> which requires that the caller has global CAP_DAC_READ_SEARCH, or is
> capable over the superblock, or is capable over the containing mount.
>
> > >> My gut says the best fix for the entire memfd issue is to simply change
> > >> memfd's and probably everything that calls shmem_file_setup to not have
> > >> an open method. That eliminates any chance anyone will do anything
> > >> clever with proc. But I can't see why it makes any sense to be able to
> > >> open another file handle into memfd's, or anything else that calls
> > >> shmem_file_setup for that matter.
> > >>
> > >> We can first try to remove the open method of memfd's set by
> > >> shmem_file_setup, and if that doesn't work we can look at fixing proc to
> > >> provide the guarantees that were assumed (as a security fix).
> > >>
> > >>
> > >> As a quality of implementation issue I can see fixing the small race
> > >> where when looking up a file descriptor through proc, exec does not
> > >> appear to be an atomic operation. I keep wondering if that is something
> > >> that should be done in get_link or d_revalidate.
> > >
> > > I don't see how d_revalidate would help, that still wouldn't be
> > > atomic.
> >
> > You have to pick the correct one, but in general it is the job of the
> > revalidate methods to find something that is stale and see if it works
> > in the current context. AKA make it look like something that wasn't
> > done atomically behaves semantically as atomically.
>
> d_revalidate refreshes dentries, but it doesn't make anything about
> the underlying inode atomic; and my understanding is that procfs wants
> to avoid tying inodes to things like task_struct or mm_struct to avoid
> keeping those objects alive unnecessarily.
>
> I think d_revalidate would make sense if, for example, we wanted the
> /proc/$pid/maps inode to hold a reference to the corresponding
> mm_struct.
>
> > >> I suspect the answer for proc_pid_get_link is to either cache something
> > >> like a seqcount, or simply to repeat the permission and existence checks
> > >> just before calling nd_jump_link.
> > >
> > > That seems like it results in complicated semantics, while a mutex
> > > would provide clear semantics. Which is already what we use in places
> > > like __pidfd_fget() and /proc/<pid>/syscall.
> >
> > An important point to remember when dealing with proc is that it is for
> > implementation purposes a distributed filesystem. Thinking of it like
> > a syscall is wrong.
> >
> > There are lots of places that for correctness reasons (and to not burden
> > the rest of the kernel) proc does something and then validates what
> > it has done is correct. AKA just like a distributed filesystem.
> >
> > As for semantics I am not proposing anything that will have complicated
> > semantics to userspace. It may have a slightly more complicated
> > implementation in proc, but that can save complications in the
> > rest of the kernel.
> >
> > Anything that needs to block exec to block to operate correctly is a
> > real can of worms review wise, and something we have repeated gotten
> > wrong in the past. Something where exec can just do it's thing and
> > proc can still give a point in time correct answer makes the analysis
> > that things won't break much simpler.
>
> Hmm... the reason we got it wrong in the past is that we were using
> cred_guard_mutex, which was held across stuff that can (indirectly)
> wait for ptrace, right? Whereas nowadays we basically only have to
> ensure that we don't block on userspace actions while holding the
> exec_update_lock? Which still requires some care but should be less
> problematic.
>
> > >> >> Question 2.
> > >> >>
> > >> >> How does this race compare to racing with setresuid?
> > >> >> Do we need to fix the setresuid case as well?
> > >> >
> > >> > Which setresuid case? setresuid clears the dumpable flag and has a
> > >> > memory barrier that is supposed to make that properly ordered against
> > >> > ptrace_may_access(); so setresuid() should normally not cause a task
> > >> > to become traceable, though that could maybe happen in weird
> > >> > scenarios.
> > >>
> > >> The cases where the dumpable flag get set are all part of exec.
> > >>
> > >> I was thinking of cases where we have a daemon that is started by
> > >> root and then it changes it's uid to do something.
> > >
> > > (Normally such a daemon would only change its EUID, which is mainly
> > > considered when the daemon acts as a subject, unless it intends to
> > > permanently drop privileges.)
> > >
> > >> Looking at ptrace_may_access the uid based checks won't allow
> > >> accessing of such a task unless it changes all of it's uids.
> > >>
> > >> At which point arguably it is on the process that calls setuid to make
> > >> certain ptracing it won't be a problem. I am not certain that ever
> > >> actually works in practice but that does seem to be what the current
> > >> code is saying.
> > >
> > > When a daemon changes its EUID for some reason, commit_creds() will
> > > change that daemon's dumpability to suid_dumpable, which will prevent
> > > ptracing it even if the UIDs match.
> > >
> > > The ptrace.2 manpage guarantees that a process which is not
> > > SUID_DUMP_USER can't be accessed without CAP_SYS_PTRACE.
> > >
> > >> Now I am wondering if dumpable should get set if setresuid changes
> > >> a uid like I described above.
> > >
> > > What do you mean by "get set"?
> >
> > I was and still am wondering if dumpable should be set if setresuid
> > completely drops all of it's uids. AKA the case where dumpable is not
> > set today.
>
> You mean the case where the EUID is different from RUID and/or SUID,
> and userspace calls setresuid(current_euid, current_euid,
> current_euid)?
>
> > Unless someone wants to do a bunch of work survey such code
> > we should probably wait until a motivating example presents itself.
> >
> >
> > >> > I think another case we should probably care about is what happens if
> > >> > a process which is only protected against ptrace by being non-dumpable
> > >> > goes through execve() - it shouldn't be possible to access resources
> > >> > associated with the pre-execve state while checking against the
> > >> > post-execve dumpability. It might be important for this that the
> > >> > do_close_on_exec() logic currently happens after committing the
> > >> > dumpable state in exec_mmap()...
> > >> >
> > >> >> Question 3.
> > >> >> Do we care about the case when a privileged process calls a setuid
> > >> >> process and drops privileges?
> > >> >
> > >> > I don't understand the question. Hmm - do you mean a case where a
> > >> > process with ruid=1000, euid=0, suid=1000 does execve() on a setuid
> > >> > 1000 binary? I think we probably don't specifically care about that...
> > >> >
> > >>
> > >> The general case would be a daemon running as root forks and exec's a
> > >> binary running as some unprivileged user fred.
> > >>
> > >> Mostly I bring it up is that it is easy to forget suid exec can drop
> > >> privileges as well as raise them.
> > >
> > > I do not understand the scenario you're describing. Can you give a
> > > specific example you're thinking of - what the ruid/euid/suid would be
> > > before execve(), and which mode the binary would be?
> >
> >
> > A process P1 with uid=euid=ruid=fsuid=0.
> >
> > A user fred with uid=1000.
> >
> > A binary B with uid=1000 (aka fred) gid=1000 (aka fred) that -r-sr-xr-x.
> > AKA everyone can read and execute the binary, and the setuid bit is set.
> >
> > Then P1 execs B.
>
> I think I see your point - in this case, if P1 calls
> setresuid(current_euid, current_euid, current_euid), fred would
> afterwards be allowed to ptrace P1.
>
> That feels to me like it is a bit weird, but probably not very
> problematic because fred owns the file; P1 is already running code
> owned by fred. Though fred can't necessarily actually change the file
> contents, if the file is immutable or the mount is readonly or such.
>
> I... think it is also rare for setuid binaries to call
> setresuid(current_euid, current_euid, current_euid), since normally it
> is desired that they keep the original RUID for stuff like
> signal-sending permissions? But I might be wrong about that.

Yes, I found very few binaries that do actually do a full setresuid().

>
> > p.s. My personal opinion is changing permissions upon exec is a dumb
> > idea. It might be worth doing the work to drop that support and to
> > update userspace to just connect to more privileged daemons when more
> > permissions are needed. AKA ssh instead of su.
>
> I agree that it would be nicer to get rid of that... I remember amluto
> saying the same thing.
>
> Lennart Poettering thinks that, too, and has added the sudo
> replacement "run0" in systemd, which works by talking to systemd and
> authenticating via polkit:
> https://mastodon.social/@pid_eins/112353324518585654

We have been working on a sudo replacement that requests permission from
a more privileged process via ipc for a long time. We've implemented it
and it speaks varlink and run0 can be used as a drop-in replacement for
sudo. We keep adding the bits and pieces to function as a sudo
replacement in most cases:

https://github.com/systemd/systemd/pull/42465

setuid will die (outside of unprivileged containers) and we're getting
very close to making that happen.