Re: [PATCH 0/2] proc: protect ptrace_may_access() with exec_update_lock

From: Jann Horn

Date: Fri Jun 05 2026 - 10:37:46 EST

On Fri, Jun 5, 2026 at 2:54 PM Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:
> > On Thu, May 28, 2026 at 3:11 PM Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:
> >> Jann Horn <jannh@xxxxxxxxxx> writes:
> >> The issue you point out with memfd's definitely needs to be fixed.
> >> It should be separated out from the rest of the races simply because
> >> it is a completely different kind of issue.
> >>
> >> I wonder if anyone even anticipated you could open another file handle
> >> to memfd's through proc. If so leaving everything to path based
> >> permissions assumed a feature of proc that doesn't exist.
> >
> > I don't think memfds are particularly special here, they are just a
> > nice, clear example of a case where an inode is protected based on
> > which processes can path-walk to it.
> >
> > As another example: Making a directory mode 0700 is also supposed to
> > prevent other users from accessing things inside it.
>
> The simple counter example is that linux has an open by inode
> facility. That is exposed to nfs and as a syscall.
>
> Well strictly speaking the syscalls name_to_handle_at and
> open_by_handle_at.
>
> Most filesystems including shmemfs support those operations.
> See shmem_export_ops.
>
> Which is a long way of saying that if someone can guess
> the inode and generation number of a memfd inode it can
> be opened with open_by_handle_at. The usual permission checks
> are performed but unless I am misreading something the only
> permission checks that are relevant are the permissions on
> the inode.

No. open_by_handle_at() is not supposed to let you bypass
non-executable directories.

Filesystems without a special export_operations::permission handler go through:

do_handle_open -> handle_to_path -> may_decode_fh

which requires that the caller has global CAP_DAC_READ_SEARCH, or is
capable over the superblock, or is capable over the containing mount.

> >> My gut says the best fix for the entire memfd issue is to simply change
> >> memfd's and probably everything that calls shmem_file_setup to not have
> >> an open method. That eliminates any chance anyone will do anything
> >> clever with proc. But I can't see why it makes any sense to be able to
> >> open another file handle into memfd's, or anything else that calls
> >> shmem_file_setup for that matter.
> >>
> >> We can first try to remove the open method of memfd's set by
> >> shmem_file_setup, and if that doesn't work we can look at fixing proc to
> >> provide the guarantees that were assumed (as a security fix).
> >>
> >>
> >> As a quality of implementation issue I can see fixing the small race
> >> where when looking up a file descriptor through proc, exec does not
> >> appear to be an atomic operation. I keep wondering if that is something
> >> that should be done in get_link or d_revalidate.
> >
> > I don't see how d_revalidate would help, that still wouldn't be
> > atomic.
>
> You have to pick the correct one, but in general it is the job of the
> revalidate methods to find something that is stale and see if it works
> in the current context. AKA make it look like something that wasn't
> done atomically behaves semantically as atomically.

d_revalidate refreshes dentries, but it doesn't make anything about
the underlying inode atomic; and my understanding is that procfs wants
to avoid tying inodes to things like task_struct or mm_struct to avoid
keeping those objects alive unnecessarily.

I think d_revalidate would make sense if, for example, we wanted the
/proc/$pid/maps inode to hold a reference to the corresponding
mm_struct.

> >> I suspect the answer for proc_pid_get_link is to either cache something
> >> like a seqcount, or simply to repeat the permission and existence checks
> >> just before calling nd_jump_link.
> >
> > That seems like it results in complicated semantics, while a mutex
> > would provide clear semantics. Which is already what we use in places
> > like __pidfd_fget() and /proc/<pid>/syscall.
>
> An important point to remember when dealing with proc is that it is for
> implementation purposes a distributed filesystem. Thinking of it like
> a syscall is wrong.
>
> There are lots of places that for correctness reasons (and to not burden
> the rest of the kernel) proc does something and then validates what
> it has done is correct. AKA just like a distributed filesystem.
>
> As for semantics I am not proposing anything that will have complicated
> semantics to userspace. It may have a slightly more complicated
> implementation in proc, but that can save complications in the
> rest of the kernel.
>
> Anything that needs to block exec to block to operate correctly is a
> real can of worms review wise, and something we have repeated gotten
> wrong in the past. Something where exec can just do it's thing and
> proc can still give a point in time correct answer makes the analysis
> that things won't break much simpler.

Hmm... the reason we got it wrong in the past is that we were using
cred_guard_mutex, which was held across stuff that can (indirectly)
wait for ptrace, right? Whereas nowadays we basically only have to
ensure that we don't block on userspace actions while holding the
exec_update_lock? Which still requires some care but should be less
problematic.

> >> >> Question 2.
> >> >>
> >> >> How does this race compare to racing with setresuid?
> >> >> Do we need to fix the setresuid case as well?
> >> >
> >> > Which setresuid case? setresuid clears the dumpable flag and has a
> >> > memory barrier that is supposed to make that properly ordered against
> >> > ptrace_may_access(); so setresuid() should normally not cause a task
> >> > to become traceable, though that could maybe happen in weird
> >> > scenarios.
> >>
> >> The cases where the dumpable flag get set are all part of exec.
> >>
> >> I was thinking of cases where we have a daemon that is started by
> >> root and then it changes it's uid to do something.
> >
> > (Normally such a daemon would only change its EUID, which is mainly
> > considered when the daemon acts as a subject, unless it intends to
> > permanently drop privileges.)
> >
> >> Looking at ptrace_may_access the uid based checks won't allow
> >> accessing of such a task unless it changes all of it's uids.
> >>
> >> At which point arguably it is on the process that calls setuid to make
> >> certain ptracing it won't be a problem. I am not certain that ever
> >> actually works in practice but that does seem to be what the current
> >> code is saying.
> >
> > When a daemon changes its EUID for some reason, commit_creds() will
> > change that daemon's dumpability to suid_dumpable, which will prevent
> > ptracing it even if the UIDs match.
> >
> > The ptrace.2 manpage guarantees that a process which is not
> > SUID_DUMP_USER can't be accessed without CAP_SYS_PTRACE.
> >
> >> Now I am wondering if dumpable should get set if setresuid changes
> >> a uid like I described above.
> >
> > What do you mean by "get set"?
>
> I was and still am wondering if dumpable should be set if setresuid
> completely drops all of it's uids. AKA the case where dumpable is not
> set today.

You mean the case where the EUID is different from RUID and/or SUID,
and userspace calls setresuid(current_euid, current_euid,
current_euid)?

> Unless someone wants to do a bunch of work survey such code
> we should probably wait until a motivating example presents itself.
>
>
> >> > I think another case we should probably care about is what happens if
> >> > a process which is only protected against ptrace by being non-dumpable
> >> > goes through execve() - it shouldn't be possible to access resources
> >> > associated with the pre-execve state while checking against the
> >> > post-execve dumpability. It might be important for this that the
> >> > do_close_on_exec() logic currently happens after committing the
> >> > dumpable state in exec_mmap()...
> >> >
> >> >> Question 3.
> >> >> Do we care about the case when a privileged process calls a setuid
> >> >> process and drops privileges?
> >> >
> >> > I don't understand the question. Hmm - do you mean a case where a
> >> > process with ruid=1000, euid=0, suid=1000 does execve() on a setuid
> >> > 1000 binary? I think we probably don't specifically care about that...
> >> >
> >>
> >> The general case would be a daemon running as root forks and exec's a
> >> binary running as some unprivileged user fred.
> >>
> >> Mostly I bring it up is that it is easy to forget suid exec can drop
> >> privileges as well as raise them.
> >
> > I do not understand the scenario you're describing. Can you give a
> > specific example you're thinking of - what the ruid/euid/suid would be
> > before execve(), and which mode the binary would be?
>
>
> A process P1 with uid=euid=ruid=fsuid=0.
>
> A user fred with uid=1000.
>
> A binary B with uid=1000 (aka fred) gid=1000 (aka fred) that -r-sr-xr-x.
> AKA everyone can read and execute the binary, and the setuid bit is set.
>
> Then P1 execs B.

I think I see your point - in this case, if P1 calls
setresuid(current_euid, current_euid, current_euid), fred would
afterwards be allowed to ptrace P1.

That feels to me like it is a bit weird, but probably not very
problematic because fred owns the file; P1 is already running code
owned by fred. Though fred can't necessarily actually change the file
contents, if the file is immutable or the mount is readonly or such.

I... think it is also rare for setuid binaries to call
setresuid(current_euid, current_euid, current_euid), since normally it
is desired that they keep the original RUID for stuff like
signal-sending permissions? But I might be wrong about that.

> p.s. My personal opinion is changing permissions upon exec is a dumb
> idea. It might be worth doing the work to drop that support and to
> update userspace to just connect to more privileged daemons when more
> permissions are needed. AKA ssh instead of su.

I agree that it would be nicer to get rid of that... I remember amluto
saying the same thing.

Lennart Poettering thinks that, too, and has added the sudo
replacement "run0" in systemd, which works by talking to systemd and
authenticating via polkit:
https://mastodon.social/@pid_eins/112353324518585654