Re: [RFC PATCH 1/3] seccomp: add SECCOMP_IOCTL_NOTIF_PIN_ARGS to close the unotify TOCTOU race
From: Cong Wang
Date: Wed May 06 2026 - 01:01:01 EST
On Mon, May 4, 2026 at 8:51 PM Christian Brauner <brauner@xxxxxxxxxx> wrote:
>
> On Sun, 03 May 2026 18:12:05 -0700, Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
> > diff --git a/fs/namei.c b/fs/namei.c
> > index c7fac83c9a85..ee86f4c91cae 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -222,6 +223,24 @@ do_getname(const char __user *filename, int flags, bool incomplete)
> > [ ... skip 15 lines ... ]
> > + pin = seccomp_pin_lookup_current((u64)(uintptr_t)filename);
> > + if (pin && pin->kind == SECCOMP_PIN_CSTRING) {
> > + if (pin->size <= 1 && !(flags & LOOKUP_EMPTY))
> > + return ERR_PTR(-ENOENT);
> > + return getname_kernel(pin->data);
> > + }
>
> Sorry, no. That's just not acceptable at all. We're not spraying
> "continue from snapshotted state" across the vfs and the kernel in
> general. This is just screaming for security issues. Anything that wants
> to do something remotely like this needs to come as generic abstraction
> where the syscall layer itself doesn't have to care at all about this.
> There are just so many corners where you run into issues with this.
You're right. Having every fetch site consult a per-task pin pointer is
exactly the kind of cross-cutting awareness that doesn't scale.
How about the following direction instead?
Reshape the mechanism as a PTRACE_SYSCALL-style redirect, applied at
the notification reply path. The supervisor describes:
struct seccomp_notif_inject {
__u64 id;
__u64 nr;
__u64 args[6];
__u64 buf; /* __user, kernel-input bytes */
__u32 buf_size;
__u32 args_in_buf_mask; /* bit i: args[i] is offset into buf */
};
NOTIF_SEND with a new FLAG_INJECTED applies the redirect. The trapped
task's nr/args registers are set via syscall_set_nr() and
syscall_set_arguments() (the same primitives ptrace uses for syscall
substitution today), and any arg flagged in args_in_buf_mask is
satisfied from a kernel-side buffer rather than from the trapped
task's mm. fs/, net/, mm/, lib/ get zero changes. The whole feature
lives in a new kernel/seccomp_inject.c plus a small dispatcher in
kernel/seccomp.c.
This is intentionally a strict subset of what ptrace can already do
via PTRACE_POKEDATA + PTRACE_SETREGSET. This does not add a
kernel capability; it provides a listener-fd-gated,
syscall-whitelisted, narrower interface to that capability for
unprivileged seccomp_unotify supervisors, where ptrace's privilege
model and per-syscall overhead are not viable. SECCOMP_IOCTL_NOTIF_ADDFD
set the precedent for this kind of narrow listener-fd interface to a
ptrace-overlapping capability.
Thanks,
Cong