Re: [PATCH 0/2] proc: protect ptrace_may_access() with exec_update_lock
From: Jann Horn
Date: Wed May 27 2026 - 09:50:36 EST
On Wed, May 27, 2026 at 2:31 PM Christian Brauner <brauner@xxxxxxxxxx> wrote:
> On Wed, May 27, 2026 at 02:01:51PM +0200, Christian Brauner wrote:
> > On Tue, May 26, 2026 at 08:22:38PM +0200, Jann Horn wrote:
> > > On Mon, May 25, 2026 at 9:56 PM Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:
> > > > I have added a couple more people who might be interested.
> > > >
> > > > Kees Cook because as you have structured this it is an exec problem.
> > > >
> > > > Oleg Nesterov as he is knowledgable about ptrace.
> > > >
> > > > Jann Horn <jannh@xxxxxxxxxx> writes:
> > > >
> > > > > My understanding is that procfs is effectively maintained by the VFS
> > > > > maintainers (though scripts/get_maintainer.pl claims that there are
> > > > > no maintainers for procfs because the VFS entry only claims files
> > > > > directly in fs/, and the procfs entry has no maintainers listed on
> > > > > it).
> > > > >
> > > > > In procfs, most uses of ptrace_may_access() should use
> > > > > exec_update_lock to avoid TOCTOU issues with concurrent privileged
> > > > > execve() (like setuid binary execution).
> > > > >
> > > > > This series doesn't fix all the remaining issues in procfs, but it fixes
> > > > > the easy cases for now; I will probably follow up with fixes for the
> > > > > gnarlier cases later unless someone else wants to do that.
> > > > >
> > > > > I have checked that procfs files still work with these changes and that
> > > > > CONFIG_PROVE_LOCKING=y doesn't generate any warnings.
> > > > >
> > > > > (checkpatch complains about missing argument names in
> > > > > proc_op::proc_get_link, but that was already the case before my
> > > > > patch.)
> > > >
> > > >
> > > > I think I finally have my context paged back in so I can intelligently
> > > > say something about this series.
> > > >
> > > > The scenario you are worried about is when exec gains privileges,
> > > > and we read through proc and authenticate with the old credentials
> > > > instead of the new credentials.
> > > >
> > > > Question 1.
> > > >
> > > > Assuming the executable is world readable (which they generally are)
> > > > is there anything that becomes accessible in that race that was
> > > > not already accessible?
> > >
> > > I believe so - the gnarliest example I am thinking of is:
> > > Memfds are always mode 0777 or 0666 (see __shmem_file_setup, which
> > > sets S_IRWXUGO), so their access control is purely based on being able
> > > to pathwalk to the memfd's inode. If you can race
> > > open(/proc/$pid/fd/$n) with the process $pid going through setuid
> > > execution and calling memfd_create(), you should be able to get
> > > read+write access to the memfd created by the setuid binary that was
> > > supposed to be private.
> > >
> > > (But I have not tested that and don't know if there are actually any
> > > setuid binaries that happen to use memfds.)
> > >
> > > > Question 2.
> > > >
> > > > How does this race compare to racing with setresuid?
> > > > Do we need to fix the setresuid case as well?
> > >
> > > Which setresuid case? setresuid clears the dumpable flag and has a
> > > memory barrier that is supposed to make that properly ordered against
> > > ptrace_may_access(); so setresuid() should normally not cause a task
> > > to become traceable, though that could maybe happen in weird
> > > scenarios.
> > >
> > > I think another case we should probably care about is what happens if
> > > a process which is only protected against ptrace by being non-dumpable
> > > goes through execve() - it shouldn't be possible to access resources
> > > associated with the pre-execve state while checking against the
> > > post-execve dumpability. It might be important for this that the
> > > do_close_on_exec() logic currently happens after committing the
> > > dumpable state in exec_mmap()...
> > >
> > > > Question 3.
> > > > Do we care about the case when a privileged process calls a setuid
> > > > process and drops privileges?
> > >
> > > I don't understand the question. Hmm - do you mean a case where a
> > > process with ruid=1000, euid=0, suid=1000 does execve() on a setuid
> > > 1000 binary? I think we probably don't specifically care about that...
> > >
> > > I think another scenario that we ideally might want to care about is
> > > what happens if a process which runs with a normal user's UIDs, but is
> > > non-dumpable, goes through execve() of a normal binary while another
> > > process tries to inspect its FDs or address space layout - it probably
> > > shouldn't be possible to get information about the pre-execve MM and
> > > O_CLOEXEC file descriptors.
> > >
> > > > Question 4.
> > > > Is it possible to use a seq_lock instead of reader writer semaphore?
> > > > Or is that only for non-sleeping readers?
> > >
> > > Linux seqcounts are 32-bit, which means they are always kind of dodgy,
> > > but they are particularly dodgy if a reader can be forced to sleep for
> > > an extended amount of time. I don't see a reason why we couldn't, in
> > > general, use a 64-bit sequence count for readers that may need to
> > > sleep while reading.
> >
> > I have a patch series for this that I started working after merging your
> > series for precisely this reason: performance. It's a few days old now.
> > I've tried various approaches and I started with a simple 32-bit counter
> > as the POC. See appended (untested) patches.
>
> In a bunch of cases we know that the critical section the callers cares
> about just is very small: creds + mm. So in that case it is easy to
> switch the credential computation into a prepare stage and a commit
> stage and then the targeted critical section just becomes:
> task->signal->seq_mm++ + task->cred = new_cred + task->mm = mm +
> task->active_mm = mm + task->signal->seq_mm--. And then the reader
> doesn't need to sleep at all and can just spin on the seqcount for the
> small window they need.
I think it's probably good to avoid creating custom spinning
primitives if possible.
We'd have to also disable preemption around the writer side to ensure
that you can't get latency spikes when such a writer happens to be
preempted by a spinning reader at a bad time, and then we'd still not
have the proper paravirt spinning to deal with vCPU preemption that
qspinlocks provide, which AFAIU could theoretically also cause latency
spikes in virtualized scenarios...