Re: [RFC PATCH] pidfs: ensure consistent ENOENT/ESRCH reporting

From: Oleg Nesterov
Date: Thu Apr 10 2025 - 09:18:39 EST


On 04/10, Christian Brauner wrote:
>
> On Thu, Apr 10, 2025 at 12:18:01PM +0200, Oleg Nesterov wrote:
> > On 04/09, Oleg Nesterov wrote:
> > >
> > > On 04/09, Christian Brauner wrote:
> > > >
> > > > The seqcounter might be
> > > > useful independent of pidfs.
> > >
> > > Are you sure? ;) to me the new pid->pid_seq needs more justification...
>
> Yeah, pretty much. I'd make use of this in other cases where we need to
> detect concurrent changes to struct pid without having to take any
> locks. Multi-threaded exec in de_exec() comes to mind as well.

Perhaps you are right, but so far I am still not sure it makes sense.
And we can always add it later if we have another (more convincing)
use-case.

> > To remind, detach_pid(pid, PIDTYPE_PID) does wake_up_all(&pid->wait_pidfd) and
> > takes pid->wait_pidfd->lock.
> >
> > So if pid_has_task(PIDTYPE_PID) succeeds, __unhash_process() -> detach_pid(TGID)
> > is not possible until we drop pid->wait_pidfd->lock.
> >
> > If detach_pid(PIDTYPE_PID) was already called and have passed wake_up_all(),
> > pid_has_task(PIDTYPE_PID) can't succeed.
>
> I know. I was trying to avoid having to take the lock and just make this
> lockless. But if you think we should use this lock here instead I'm
> willing to do this. I just find the sequence counter more elegant than
> the spin_lock_irq().

This is subjective, and quite possibly I am wrong. But yes, I'd prefer
to (ab)use pid->wait_pidfd->lock in pidfd_prepare() for now and not
penalize __unhash_process(). Simply because this is simpler.

If you really dislike taking wait_pidfd->lock, we can add mb() into
__unhash_process() or even smp_mb__after_spinlock() into __change_pid(),
but this will need a lengthy comment...



As for your patch... it doesn't apply on top of 3/4, but I guess it
is clear what does it do, and (unfortunately ;) it looks correct, so
I won't insist too much. See a couple of nits below.

> this imho and it would give pidfds a reliable way to detect relevant
> concurrent changes locklessly without penalizing other critical paths
> (e.g., under tasklist_lock) in the kernel.

Can't resist... Note that raw_seqcount_begin() in pidfd_prepare() will
take/drop tasklist_lock if it races with __unhash_process() on PREEMPT_RT.
Yes, this is unlikely case, but still...

Now. Unless I misread your patch, pidfd_prepare() does "err = 0" only
once before the main loop. And this is correct. But this means that
we do not need the do/while loop.

If read_seqcount_retry() returns true, we can safely return -ESRCH. So
we can do

seq = raw_seqcount_begin(&pid->pid_seq);

if (!PIDFD_THREAD && !pid_has_task(PIDTYPE_TGID))
err = -ENOENT;

if (!pid_has_task(PIDTYPE_PID))
err = -ESRCH;

if (read_seqcount_retry(pid->pid_seq, seq))
err = -ESRCH;

In fact we don't even need raw_seqcount_begin(), we could use
raw_seqcount_try_begin().

And why seqcount_rwlock_t? A plain seqcount_t can equally work.


And, if we use seqcount_rwlock_t,

lockdep_assert_held_write(&tasklist_lock);
...
raw_write_seqcount_begin(pid->pid_seq);

in __unhash_process() looks a bit strange. I'd suggest to use
write_seqcount_begin() which does seqprop_assert() and kill
lockdep_assert_held_write().

Oleg.