Re: [PATCH RFC v3 1/2] pidfs: add inode ownership and permission checks

From: Christian Brauner

Date: Wed Feb 25 2026 - 10:33:35 EST

> > +static int pidfs_permission(struct mnt_idmap *idmap, struct inode *inode,
> > + int mask)
> > +{
> > + struct pid *pid = inode->i_private;
> > + struct task_struct *task;
> > + const struct cred *cred;
> > + u64 pid_tg_ino;
> > +
> > + scoped_guard(rcu) {
> > + task = pid_task(pid, PIDTYPE_PID);
> > + if (task) {
> > + if (unlikely(task->flags & PF_KTHREAD))
> > + return -EPERM;
> > +
> > + cred = __task_cred(task);
> > + pid_tg_ino = task_tgid(task)->ino;
>
> Can this NULL deref if the task concurrently gets reaped and has
> detach_pid() called on it?

So the thread-group leader task/pid is kept around until all subthreads
have exited. That's what delay_group_leader() does. So two cases:

(1) So if @task wasn't a thread-group leader but a subthread then
task_tgid(task) cannot be NULL.

(2) If @task is itself a thread-group leader then yes, it's possible via
__unhash_process().

So yeah, this needs handling.

> > + } else {
> > + struct pidfs_attr *attr;
> > +
> > + attr = READ_ONCE(pid->attr);
> > + VFS_WARN_ON_ONCE(!attr);
> > + /*
> > + * During copy_process() with CLONE_PIDFD the
> > + * task hasn't been attached to the pid yet so
> > + * pid_task() returns NULL and there's no
> > + * exit_cred as the task obviously hasn't
> > + * exited. Use the parent's credentials.
> > + */
> > + cred = attr->exit_cred;
> > + if (!cred)
> > + cred = current_cred();
> > + pid_tg_ino = attr->exit_tgid_ino;
> > + }
> > +
> > + /*
> > + * If the caller and the target are in the same
> > + * thread-group or the caller can signal the target
> > + * we're good.
> > + */
> > + if (pid_tg_ino != task_tgid(current)->ino &&
> > + !may_signal_creds(current_cred(), cred))
> > + return -EPERM;
> > +
> > + /*
> > + * This is racy but not more racy then what we generally
> > + * do for permission checking.
> > + */
> > + WRITE_ONCE(inode->i_uid, cred->euid);
> > + WRITE_ONCE(inode->i_gid, cred->egid);
>
> I realize that using ->euid here matches what procfs does in
> task_dump_owner(), but it doesn't make sense to me. The EUID is kinda
> inherently "subjective" and doesn't really make sense in a context
> like this where we're treating the process as an object. See also how,
> when sending a signal to a process, kill_ok_by_cred() doesn't care
> about the EUID of the target process, because it would be silly to
> allow a user to send signals to a fileserver that happens to briefly
> call seteuid() to access the filesystem in the name of the user, or
> something like that.
>
> The thing that IMO most expresses the objective identity of a process
> is the RUID (notably that includes that it stays the same across
> setuid execution unless the process explicitly changes RUID).
>
> I think this should be using cred->uid and cred->gid so that the
> permission check below makes more sense. That said, if you want to
> instead follow the precedent of procfs and rely on the more explicit
> permission checks above to actually provide security, I guess that
> works too...
>
> > + }
> > + return generic_permission(&nop_mnt_idmap, inode, mask);
> > +}
> [...]
> > {
> > diff --git a/kernel/signal.c b/kernel/signal.c
> > index d65d0fe24bfb..e20dabf143c2 100644
> > --- a/kernel/signal.c
> > +++ b/kernel/signal.c
> > @@ -777,19 +777,22 @@ static inline bool si_fromuser(const struct kernel_siginfo *info)
> > (!is_si_special(info) && SI_FROMUSER(info));
> > }
> >
> > +bool may_signal_creds(const struct cred *signaler_cred,
> > + const struct cred *signalee_cred)
> > +{
> > + return uid_eq(signaler_cred->euid, signalee_cred->suid) ||
> > + uid_eq(signaler_cred->euid, signalee_cred->uid) ||
> > + uid_eq(signaler_cred->uid, signalee_cred->suid) ||
> > + uid_eq(signaler_cred->uid, signalee_cred->uid) ||
> > + ns_capable(signalee_cred->user_ns, CAP_KILL);
> > +}
>
> I don't like reusing the signal sending permission checks here - in my
> opinion, filesystem operations shouldn't be allowed based on the
> caller's *RUID*. They should ideally be using the caller's FSUID.

Ok, a combined model sounds great to me based on ruid/rgid of the target
task and the fsuid/fsgid of the caller.