Re: [PATCH RFC 3/4] pidfd: improve uapi when task isn't found
From: Oleg Nesterov
Date: Fri Apr 04 2025 - 08:38:28 EST
On 04/03, Christian Brauner wrote:
>
> We currently report EINVAL whenever a struct pid has no tasked attached
> anymore thereby conflating two concepts:
>
> (1) The task has already been reaped.
> (2) The caller requested a pidfd for a thread-group leader but the pid
> actually references a struct pid that isn't used as a thread-group
> leader.
>
> This is causing issues for non-threaded workloads as in [1].
>
> This patch tries to allow userspace to distinguish between (1) and (2).
> This is racy of course but that shouldn't matter.
>
> Link: https://github.com/systemd/systemd/pull/36982 [1]
> Signed-off-by: Christian Brauner <brauner@xxxxxxxxxx>
For this series:
Reviewed-by: Oleg Nesterov <oleg@xxxxxxxxxx>
But I have a couple of cosmetic nits...
> int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
> {
> - bool thread = flags & PIDFD_THREAD;
> + int err = 0;
>
> - if (!pid_has_task(pid, thread ? PIDTYPE_PID : PIDTYPE_TGID))
> - return -EINVAL;
> + if (!(flags & PIDFD_THREAD)) {
> + /*
> + * If this is struct pid isn't used as a thread-group
> + * leader pid but the caller requested to create a
> + * thread-group leader pidfd then report ENOENT to the
> + * caller as a hint.
> + */
> + if (!pid_has_task(pid, PIDTYPE_TGID))
> + err = -ENOENT;
> + }
> +
> + /*
> + * If this wasn't a thread-group leader struct pid or the task
> + * got reaped in the meantime report -ESRCH to userspace.
> + *
> + * This is racy of course. This could've not been a thread-group
> + * leader struct pid and we set ENOENT above but in the meantime
> + * the task got reaped. Or there was a multi-threaded-exec by a
> + * subthread and we were a thread-group leader but now got
> + * killed.
The comment about the multi-threaded-exec looks a bit misleading to me.
If this pid is a group-leader-pid and we race with de_thread() which does
exchange_tids(tsk, leader);
transfer_pid(leader, tsk, PIDTYPE_TGID);
nothing "bad" can happen, both pid_has_task(PIDTYPE_PID) or
pid_has_task(PIDTYPE_TGID) can't return NULL during (or after) this
transition.
hlists_swap_heads_rcu() or hlist_replace_rcu() can't make
hlist_head->first == NULL during this transition...
Or I misunderstood the comment?
And... the code looks a bit overcomplicated to me, why not simply
int pidfd_prepare(struct pid *pid, unsigned int flags, struct file **ret)
{
if (!pid_has_task(pid, PIDTYPE_PID))
return -ESRCH;
if (!(flags & PIDFD_THREAD) && !pid_has_task(pid, PIDTYPE_TGID))
return -ENOENT;
return __pidfd_prepare(pid, flags, ret);
}
? Of course, the comments should stay.
But again, this is cosmetic/subjective, please do what you like more.
Oleg.