Re: [PATCH RFC 3/4] pidfd: improve uapi when task isn't found

From: Oleg Nesterov
Date: Fri Apr 04 2025 - 08:38:28 EST


On 04/03, Christian Brauner wrote:
>
> We currently report EINVAL whenever a struct pid has no tasked attached
> anymore thereby conflating two concepts:
>
> (1) The task has already been reaped.
> (2) The caller requested a pidfd for a thread-group leader but the pid
> actually references a struct pid that isn't used as a thread-group
> leader.
>
> This is causing issues for non-threaded workloads as in [1].
>
> This patch tries to allow userspace to distinguish between (1) and (2).
> This is racy of course but that shouldn't matter.
>
> Link: https://github.com/systemd/systemd/pull/36982 [1]
> Signed-off-by: Christian Brauner <brauner@xxxxxxxxxx>

For this series:

Reviewed-by: Oleg Nesterov <oleg@xxxxxxxxxx>


But I have a couple of cosmetic nits...

> int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
> {
> - bool thread = flags & PIDFD_THREAD;
> + int err = 0;
>
> - if (!pid_has_task(pid, thread ? PIDTYPE_PID : PIDTYPE_TGID))
> - return -EINVAL;
> + if (!(flags & PIDFD_THREAD)) {
> + /*
> + * If this is struct pid isn't used as a thread-group
> + * leader pid but the caller requested to create a
> + * thread-group leader pidfd then report ENOENT to the
> + * caller as a hint.
> + */
> + if (!pid_has_task(pid, PIDTYPE_TGID))
> + err = -ENOENT;
> + }
> +
> + /*
> + * If this wasn't a thread-group leader struct pid or the task
> + * got reaped in the meantime report -ESRCH to userspace.
> + *
> + * This is racy of course. This could've not been a thread-group
> + * leader struct pid and we set ENOENT above but in the meantime
> + * the task got reaped. Or there was a multi-threaded-exec by a
> + * subthread and we were a thread-group leader but now got
> + * killed.

The comment about the multi-threaded-exec looks a bit misleading to me.
If this pid is a group-leader-pid and we race with de_thread() which does

exchange_tids(tsk, leader);
transfer_pid(leader, tsk, PIDTYPE_TGID);

nothing "bad" can happen, both pid_has_task(PIDTYPE_PID) or
pid_has_task(PIDTYPE_TGID) can't return NULL during (or after) this
transition.

hlists_swap_heads_rcu() or hlist_replace_rcu() can't make
hlist_head->first == NULL during this transition...

Or I misunderstood the comment?

And... the code looks a bit overcomplicated to me, why not simply

int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
{
if (!pid_has_task(pid, PIDTYPE_PID))
return -ESRCH;

if (!(flags & PIDFD_THREAD) && !pid_has_task(pid, PIDTYPE_TGID))
return -ENOENT;

return __pidfd_prepare(pid, flags, ret);
}

? Of course, the comments should stay.

But again, this is cosmetic/subjective, please do what you like more.

Oleg.