Re: [PATCH RFC v3 1/2] pidfs: add inode ownership and permission checks

From: Jann Horn

Date: Tue Feb 24 2026 - 16:23:32 EST

On Mon, Feb 23, 2026 at 2:20 PM Christian Brauner <brauner@xxxxxxxxxx> wrote:
> Right now we only support trusted.* xattrs which require CAP_SYS_ADMIN
> which doesn't really require any meaningful permission checking. But in
> order to support user.* xattrs and custom pidfs.* xattrs in the future
> we need permission checking for pidfs inodes. Add baseline permission
> checking that can later be extended with additional write-time checks
> for specific pidfs.* xattrs.
>
> Make the effective {u,g}id of the task the owner of the pidfs inode
> (like procfs does). The ownership is set when the dentry is first
> stashed and reported dynamically via getattr since credentials may
> change due to setuid() and similar operations. For kernel threads use
> root, for exited tasks use the credentials saved at exit time.
>
> The inode's ownership is updated via WRITE_ONCE() from the getattr()
> and permission() callbacks. This doesn't serialize against
> inode->i_op->setattr() but since pidfs rejects setattr() this isn't
> currently an issue. A seqcount-based approach can be used if setattr()
> support is added in the future [1].
>
> Save the task's credentials and thread group pid inode number at exit
> time so that ownership and permission checks remain functional after
> the task has been reaped.
>
> Add a permission callback that checks access in two steps:
>
> (1) Verify the caller is either in the same thread group as the target
> or has equivalent signal permissions. This reuses the same
> uid-based logic as kill() by extracting may_signal_creds() from
> kill_ok_by_cred() so it can operate on credential pointers
> directly. For exited tasks the check uses the saved exit
> credentials and compares thread group identity.
>
> (2) Perform standard POSIX permission checking via generic_permission()
> against the inode's ownership and mode bits.
>
> This is intentionally less strict than ptrace_may_access() because pidfs
> currently does not allow operating on data that is completely private to
> the process such as its mm or file descriptors. Additional checks will
> be needed once that changes.
>
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/log/?h=work.inode.seqcount [1]
> Signed-off-by: Christian Brauner <brauner@xxxxxxxxxx>
> ---
> fs/pidfs.c | 133 +++++++++++++++++++++++++++++++++++++++++++++++----
> include/linux/cred.h | 2 +
> kernel/signal.c | 19 ++++----
> 3 files changed, 136 insertions(+), 18 deletions(-)
>
> diff --git a/fs/pidfs.c b/fs/pidfs.c
> index 318253344b5c..16a3cfa84af4 100644
> --- a/fs/pidfs.c
> +++ b/fs/pidfs.c
[...]
> +static int pidfs_permission(struct mnt_idmap *idmap, struct inode *inode,
> + int mask)
> +{
> + struct pid *pid = inode->i_private;
> + struct task_struct *task;
> + const struct cred *cred;
> + u64 pid_tg_ino;
> +
> + scoped_guard(rcu) {
> + task = pid_task(pid, PIDTYPE_PID);
> + if (task) {
> + if (unlikely(task->flags & PF_KTHREAD))
> + return -EPERM;
> +
> + cred = __task_cred(task);
> + pid_tg_ino = task_tgid(task)->ino;

Can this NULL deref if the task concurrently gets reaped and has
detach_pid() called on it?

> + } else {
> + struct pidfs_attr *attr;
> +
> + attr = READ_ONCE(pid->attr);
> + VFS_WARN_ON_ONCE(!attr);
> + /*
> + * During copy_process() with CLONE_PIDFD the
> + * task hasn't been attached to the pid yet so
> + * pid_task() returns NULL and there's no
> + * exit_cred as the task obviously hasn't
> + * exited. Use the parent's credentials.
> + */
> + cred = attr->exit_cred;
> + if (!cred)
> + cred = current_cred();
> + pid_tg_ino = attr->exit_tgid_ino;
> + }
> +
> + /*
> + * If the caller and the target are in the same
> + * thread-group or the caller can signal the target
> + * we're good.
> + */
> + if (pid_tg_ino != task_tgid(current)->ino &&
> + !may_signal_creds(current_cred(), cred))
> + return -EPERM;
> +
> + /*
> + * This is racy but not more racy then what we generally
> + * do for permission checking.
> + */
> + WRITE_ONCE(inode->i_uid, cred->euid);
> + WRITE_ONCE(inode->i_gid, cred->egid);

I realize that using ->euid here matches what procfs does in
task_dump_owner(), but it doesn't make sense to me. The EUID is kinda
inherently "subjective" and doesn't really make sense in a context
like this where we're treating the process as an object. See also how,
when sending a signal to a process, kill_ok_by_cred() doesn't care
about the EUID of the target process, because it would be silly to
allow a user to send signals to a fileserver that happens to briefly
call seteuid() to access the filesystem in the name of the user, or
something like that.

The thing that IMO most expresses the objective identity of a process
is the RUID (notably that includes that it stays the same across
setuid execution unless the process explicitly changes RUID).

I think this should be using cred->uid and cred->gid so that the
permission check below makes more sense. That said, if you want to
instead follow the precedent of procfs and rely on the more explicit
permission checks above to actually provide security, I guess that
works too...

> + }
> + return generic_permission(&nop_mnt_idmap, inode, mask);
> +}
[...]
> {
> diff --git a/kernel/signal.c b/kernel/signal.c
> index d65d0fe24bfb..e20dabf143c2 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -777,19 +777,22 @@ static inline bool si_fromuser(const struct kernel_siginfo *info)
> (!is_si_special(info) && SI_FROMUSER(info));
> }
>
> +bool may_signal_creds(const struct cred *signaler_cred,
> + const struct cred *signalee_cred)
> +{
> + return uid_eq(signaler_cred->euid, signalee_cred->suid) ||
> + uid_eq(signaler_cred->euid, signalee_cred->uid) ||
> + uid_eq(signaler_cred->uid, signalee_cred->suid) ||
> + uid_eq(signaler_cred->uid, signalee_cred->uid) ||
> + ns_capable(signalee_cred->user_ns, CAP_KILL);
> +}

I don't like reusing the signal sending permission checks here - in my
opinion, filesystem operations shouldn't be allowed based on the
caller's *RUID*. They should ideally be using the caller's FSUID.

> +
> /*
> * called with RCU read lock from check_kill_permission()
> */
> static bool kill_ok_by_cred(struct task_struct *t)
> {
> - const struct cred *cred = current_cred();
> - const struct cred *tcred = __task_cred(t);
> -
> - return uid_eq(cred->euid, tcred->suid) ||
> - uid_eq(cred->euid, tcred->uid) ||
> - uid_eq(cred->uid, tcred->suid) ||
> - uid_eq(cred->uid, tcred->uid) ||
> - ns_capable(tcred->user_ns, CAP_KILL);
> + return may_signal_creds(current_cred(), __task_cred(t));
> }
>
> /*
>
> --
> 2.47.3
>