Re: [PATCH 0/2] proc: protect ptrace_may_access() with exec_update_lock

From: Jann Horn

Date: Tue May 26 2026 - 14:23:26 EST


On Mon, May 25, 2026 at 9:56 PM Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:
> I have added a couple more people who might be interested.
>
> Kees Cook because as you have structured this it is an exec problem.
>
> Oleg Nesterov as he is knowledgable about ptrace.
>
> Jann Horn <jannh@xxxxxxxxxx> writes:
>
> > My understanding is that procfs is effectively maintained by the VFS
> > maintainers (though scripts/get_maintainer.pl claims that there are
> > no maintainers for procfs because the VFS entry only claims files
> > directly in fs/, and the procfs entry has no maintainers listed on
> > it).
> >
> > In procfs, most uses of ptrace_may_access() should use
> > exec_update_lock to avoid TOCTOU issues with concurrent privileged
> > execve() (like setuid binary execution).
> >
> > This series doesn't fix all the remaining issues in procfs, but it fixes
> > the easy cases for now; I will probably follow up with fixes for the
> > gnarlier cases later unless someone else wants to do that.
> >
> > I have checked that procfs files still work with these changes and that
> > CONFIG_PROVE_LOCKING=y doesn't generate any warnings.
> >
> > (checkpatch complains about missing argument names in
> > proc_op::proc_get_link, but that was already the case before my
> > patch.)
>
>
> I think I finally have my context paged back in so I can intelligently
> say something about this series.
>
> The scenario you are worried about is when exec gains privileges,
> and we read through proc and authenticate with the old credentials
> instead of the new credentials.
>
> Question 1.
>
> Assuming the executable is world readable (which they generally are)
> is there anything that becomes accessible in that race that was
> not already accessible?

I believe so - the gnarliest example I am thinking of is:
Memfds are always mode 0777 or 0666 (see __shmem_file_setup, which
sets S_IRWXUGO), so their access control is purely based on being able
to pathwalk to the memfd's inode. If you can race
open(/proc/$pid/fd/$n) with the process $pid going through setuid
execution and calling memfd_create(), you should be able to get
read+write access to the memfd created by the setuid binary that was
supposed to be private.

(But I have not tested that and don't know if there are actually any
setuid binaries that happen to use memfds.)

> Question 2.
>
> How does this race compare to racing with setresuid?
> Do we need to fix the setresuid case as well?

Which setresuid case? setresuid clears the dumpable flag and has a
memory barrier that is supposed to make that properly ordered against
ptrace_may_access(); so setresuid() should normally not cause a task
to become traceable, though that could maybe happen in weird
scenarios.

I think another case we should probably care about is what happens if
a process which is only protected against ptrace by being non-dumpable
goes through execve() - it shouldn't be possible to access resources
associated with the pre-execve state while checking against the
post-execve dumpability. It might be important for this that the
do_close_on_exec() logic currently happens after committing the
dumpable state in exec_mmap()...

> Question 3.
> Do we care about the case when a privileged process calls a setuid
> process and drops privileges?

I don't understand the question. Hmm - do you mean a case where a
process with ruid=1000, euid=0, suid=1000 does execve() on a setuid
1000 binary? I think we probably don't specifically care about that...

I think another scenario that we ideally might want to care about is
what happens if a process which runs with a normal user's UIDs, but is
non-dumpable, goes through execve() of a normal binary while another
process tries to inspect its FDs or address space layout - it probably
shouldn't be possible to get information about the pre-execve MM and
O_CLOEXEC file descriptors.

> Question 4.
> Is it possible to use a seq_lock instead of reader writer semaphore?
> Or is that only for non-sleeping readers?

Linux seqcounts are 32-bit, which means they are always kind of dodgy,
but they are particularly dodgy if a reader can be forced to sleep for
an extended amount of time. I don't see a reason why we couldn't, in
general, use a 64-bit sequence count for readers that may need to
sleep while reading.

> There have been a number of nasty cases lurking in the background
> involving seccomp filters, PTRACE_EVENT_EXIT, de_thread and the like.
>
> Blocking locks, especially ones that get widely used, just scare me in
> this area. Being able to see that something happened between start and
> finish and say -EAGAIN or retrying internally seems like it would be
> much less prone to weirdness.

I guess for do_task_stat() we could just switch to down_read_trylock()
instead of down_read_killable(), and proceed with "permitted = 0" if
the trylock fails - almost all the values shown are related to the MM,
and are therefore not stable across execve() anyway.

I think using seqlocks with a retry loop wouldn't work with the code
as-is, because in the middle of execve, there are points where the
file descriptor table still contains entries that we don't want to be
accessible with the task's current dumpability, or where we have
already switched to a new MM without having updated the credentials
yet.
I think we could make it work - we could add another set of creds to
the task, and let ptrace_may_access() check against both the
pre-execve and post-execve credentials and dumpability, but that feels
overengineered.

> The ugly with PTRACE_EVENT_EXIT as I recall is that if ptrace stops one
> of the threads (not the one calling exec) at PTRACE_EVENT_EXIT it can
> block de_thread, which blocks the rest of exec. But there is something
> in there where the ptracer hangs waiting for the exec to complete. So
> everything just stalls. The ptracer waiting for exec the exec waiting
> for the ptracer. SIGKILL can get you out of that mess last I looked.
> Still it is an ugly mess.
>
> Getting everything away from that mess is why we have exec_update_lock
> instead of just cred_guard_mutex.

And the exec_update_lock avoids that because it is not held in
de_thread(), only across the following part of execve, where not much
stuff happens that could block for a long time, right?

load_elf_binary
begin_new_exec
exec_mmap
down_write_killable(&tsk->signal->exec_update_lock)
mmput [brauner@ has a patch to move this]
flush_thread
do_close_on_exec [notably this can lead to filp->f_op->flush()
calls, which AFAIK can block forever on FUSE/NFS]
commit_creds
setup_new_exec
up_write(&me->signal->exec_update_lock)

I think we might want to do something about the do_close_on_exec()
stuff, like deferring the filp_flush() to a later time, but I don't
really see deadlock potential here.

> I would really appreciate hearing the scenarios you are aiming to fix
> and how this fixes them. There are enough races and special cases
> I don't feel comfortable reading that we just need exec_update_lock
> around ptrace_may_access. It is not clear to me that is sufficient
> to close the small races we are worried about here.

The main thing I'm trying to address here are scenarios of the shape
"process A accesses process B through procfs while process B goes
through a privileged execution (in particular by executing a setuid
binary)". /proc/$pid/fd/$fd (part of patch 2) seems particularly
egregious because it can likely be used to gain access to memfds of
setuid binaries; other files are less egregious, but might lead to
things like userspace ASLR/pointer leaks (in particular do_task_stat()
and proc_map_files_readdir()).

A second scenario I have in mind is "process A accesses process B
through procfs while process B goes through a normal execution that
makes it dumpable".

The overarching logic I have at the back of my mind here is: If an
"incarnation" is the combination of a process and an mm_struct, then
holding exec_update_lock ensures that the credentials/dumpability we
have observed are associated with the same incarnation as the MM and
the file descriptor table whose properties we read afterwards.

> If I could trace through someone else's logic I could be convinced
> and the next people to deal with the code could look at it and see
> ah. That is the detail that was missed when it has to be fixed again.