Re: [RFC PATCH] pidfs: ensure consistent ENOENT/ESRCH reporting
From: Christian Brauner
Date: Thu Apr 10 2025 - 16:06:28 EST
On Thu, Apr 10, 2025 at 03:10:09PM +0200, Oleg Nesterov wrote:
> On 04/10, Christian Brauner wrote:
> >
> > On Thu, Apr 10, 2025 at 12:18:01PM +0200, Oleg Nesterov wrote:
> > > On 04/09, Oleg Nesterov wrote:
> > > >
> > > > On 04/09, Christian Brauner wrote:
> > > > >
> > > > > The seqcounter might be
> > > > > useful independent of pidfs.
> > > >
> > > > Are you sure? ;) to me the new pid->pid_seq needs more justification...
> >
> > Yeah, pretty much. I'd make use of this in other cases where we need to
> > detect concurrent changes to struct pid without having to take any
> > locks. Multi-threaded exec in de_exec() comes to mind as well.
>
> Perhaps you are right, but so far I am still not sure it makes sense.
> And we can always add it later if we have another (more convincing)
> use-case.
>
> > > To remind, detach_pid(pid, PIDTYPE_PID) does wake_up_all(&pid->wait_pidfd) and
> > > takes pid->wait_pidfd->lock.
> > >
> > > So if pid_has_task(PIDTYPE_PID) succeeds, __unhash_process() -> detach_pid(TGID)
> > > is not possible until we drop pid->wait_pidfd->lock.
> > >
> > > If detach_pid(PIDTYPE_PID) was already called and have passed wake_up_all(),
> > > pid_has_task(PIDTYPE_PID) can't succeed.
> >
> > I know. I was trying to avoid having to take the lock and just make this
> > lockless. But if you think we should use this lock here instead I'm
> > willing to do this. I just find the sequence counter more elegant than
> > the spin_lock_irq().
>
> This is subjective, and quite possibly I am wrong. But yes, I'd prefer
> to (ab)use pid->wait_pidfd->lock in pidfd_prepare() for now and not
> penalize __unhash_process(). Simply because this is simpler.
>
> If you really dislike taking wait_pidfd->lock, we can add mb() into
> __unhash_process() or even smp_mb__after_spinlock() into __change_pid(),
> but this will need a lengthy comment...
No, I don't think we should do that.
> As for your patch... it doesn't apply on top of 3/4, but I guess it
> is clear what does it do, and (unfortunately ;) it looks correct, so
> I won't insist too much. See a couple of nits below.
>
> > this imho and it would give pidfds a reliable way to detect relevant
> > concurrent changes locklessly without penalizing other critical paths
> > (e.g., under tasklist_lock) in the kernel.
>
> Can't resist... Note that raw_seqcount_begin() in pidfd_prepare() will
> take/drop tasklist_lock if it races with __unhash_process() on PREEMPT_RT.
Eeeeew,
if (!IS_ENABLED(CONFIG_PREEMPT_RT)) \
return seq; \
\
if (preemptible && unlikely(seq & 1)) { \
__SEQ_LOCK(lockbase##_lock(s->lock)); \
__SEQ_LOCK(lockbase##_unlock(s->lock)); \
priority inversion fix, I take it. That's equally ugly as what we had to
do for mnt_get_write_access()...
I actually think what you just pointed out is rather problematic. It's
absolutely wild that raw_seqcount_begin() suddenly implies locking.
How isn't that a huge landmine? On non-rt I can happily do:
acquire_associated_lock()
raw_seqcount_begin()
drop_associated_lock()
But this will immediately turn into a deadlock on preempt-rt, no?
> Yes, this is unlikely case, but still...
>
> Now. Unless I misread your patch, pidfd_prepare() does "err = 0" only
> once before the main loop. And this is correct. But this means that
> we do not need the do/while loop.
Yes, I know. I simply used the common idiom.
>
> If read_seqcount_retry() returns true, we can safely return -ESRCH. So
> we can do
>
> seq = raw_seqcount_begin(&pid->pid_seq);
>
> if (!PIDFD_THREAD && !pid_has_task(PIDTYPE_TGID))
> err = -ENOENT;
>
> if (!pid_has_task(PIDTYPE_PID))
> err = -ESRCH;
>
> if (read_seqcount_retry(pid->pid_seq, seq))
> err = -ESRCH;
>
> In fact we don't even need raw_seqcount_begin(), we could use
> raw_seqcount_try_begin().
>
> And why seqcount_rwlock_t? A plain seqcount_t can equally work.
Yes, but this way its dependence on tasklist_lock is natively integrated
with lockdep afaict:
* typedef seqcount_LOCKNAME_t - sequence counter with LOCKNAME associated
* @seqcount: The real sequence counter
* @lock: Pointer to the associated lock
*
* A plain sequence counter with external writer synchronization by
* LOCKNAME @lock. The lock is associated to the sequence counter in the
* static initializer or init function. This enables lockdep to validate
* that the write side critical section is properly serialized.
*
* LOCKNAME: raw_spinlock, spinlock, rwlock or mutex
*/
/*
* seqcount_LOCKNAME_init() - runtime initializer for seqcount_LOCKNAME_t
* @s: Pointer to the seqcount_LOCKNAME_t instance
* @lock: Pointer to the associated lock
*/
#define seqcount_LOCKNAME_init(s, _lock, lockname) \
do { \
seqcount_##lockname##_t *____s = (s); \
seqcount_init(&____s->seqcount); \
__SEQ_LOCK(____s->lock = (_lock)); \
} while (0)
#define seqcount_raw_spinlock_init(s, lock) seqcount_LOCKNAME_init(s, lock, raw_spinlock)
#define seqcount_spinlock_init(s, lock) seqcount_LOCKNAME_init(s, lock, spinlock)
#define seqcount_rwlock_init(s, lock) seqcount_LOCKNAME_init(s, lock, rwlock)
#define seqcount_mutex_init(s, lock) seqcount_LOCKNAME_init(s, lock, mutex)
> And, if we use seqcount_rwlock_t,
>
> lockdep_assert_held_write(&tasklist_lock);
> ...
> raw_write_seqcount_begin(pid->pid_seq);
>
> in __unhash_process() looks a bit strange. I'd suggest to use
> write_seqcount_begin() which does seqprop_assert() and kill
> lockdep_assert_held_write().
>
> Oleg.
>