Re: [RFC PATCH v2 0/3] seccomp: SECCOMP_IOCTL_NOTIF_INJECT for race-free unotify

From: Andy Lutomirski

Date: Thu May 28 2026 - 14:19:46 EST

On Thu, May 28, 2026 at 10:42 AM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
>
> Hi Andy,
>
> On Tue, May 26, 2026 at 12:03 PM Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> >
> > On Thu, May 14, 2026 at 9:27 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
> > >
> > > From: Cong Wang <cwang@xxxxxxxxxxxxxx>
> > >
> > > This is a complete rework of v1 (PIN_ARGS), reshaped to address the
> > > review feedback that having every syscall-arg fetch site consult a
> > > per-task pin pointer is cross-cutting awareness that does not scale.
> > >
> > > v1 thread:
> > > https://lore.kernel.org/lkml/20260504011207.539408-1-xiyou.wangcong@xxxxxxxxx
> > >
> > > ## Changes since v1
> >
> > Here are some thoughts:
> >
> > >
> > > v2 inverts the model. The supervisor no longer pins args for a
> > > resumed syscall body to consume; it describes a substitute syscall
> > > (nr + args[6]) whose pointer-shaped args are encoded as byte offsets
> > > into a kernel-side buffer. On SECCOMP_USER_NOTIF_FLAG_INJECTED, the
> > > trapped task wakes inside seccomp_do_user_notification(), dispatches
> > > into a kernel-mode syscall helper (filp_open / kernel_bind /
> > > kernel_write for v1), and the helper's return value becomes the
> > > trapped syscall's return value. The trapped task's user mm is never
> > > re-read for the substituted syscall.
> >
> > This sounds like it could be done well or it could be done poorly.
> > Doing it poorly sounds like it would resemble set_fs(), and set_fs()
> > was awful. Please don't reintroduce it or anything like it.
>
> Noted.
>
> For v2, the injectors call filp_open(), kernel_bind(), kernel_write(),
> kernel-pointer entrypoints that exist specifically so kernel callers
> don't have to spoof user-space.
>
> >
> > Doing it well sounds like introducing a bunch of new entrypoints. In
> > some sense this seems like a nice plan, except that essentially every
> > syscall doesn't work like that. So getting any sort of decent
> > coverage could involve extensive kernel changes and might involve
> > adding lots of new entrypoints that are only used for this new system,
> > which isn't great.
>
> This is a valid concern.
>
> Currently, we only have 3 entrypoints which all have pre-existing
> kernel-side API's. In the future, we may need to extend it, for example,
> for execve().
>
> However, the number of syscalls we inject is still very small, compared
> with the total number of syscalls on Linux. I'd never anticipate this list
> to grow beyond 10, since most of the syscall injections don't have
> TOCTOU issues at all.
>
> >
> > > The full motivation, including the threat model (adversarial AI
> > > agents in the same address space) and the concrete user (Sandlock,
> > > https://github.com/multikernel/sandlock), is in the v1 cover letter
> > > above.
> >
> > Whoa there. "adversarial AI agents" aren't a threat model that makes
> > sense in this context. I think you mean "multiple tasks, all running
> > untrusted code, potentially sharing an address space".
>
> Right, I will update the wording.
>
> >
> >
> > But here are some other thoughts:
> >
> > > v1 injectable-syscall whitelist:
> > >
> > > - openat (filp_open + fd_install)
> > > - bind (sockfd_lookup + kernel_bind)
> > > - write (kernel_write)
> >
> > How gnarly would an actual API for this be? By "actual API" I mean an
> > fd that represents complete control over a target task (which the
> > existing seccomp fd sort of is) and syscalls issued against that fd
> > that do openat, bind, read, write, etc.
>
> Excellent suggestion! How about the following API?
>
> ioctl(lfd, SECCOMP_IOCTL_NOTIF_RECV, &req);
> /* req.data.nr == __NR_openat; args[1] is target's pointer. */
>
> read_target_string(req.pid, req.data.args[1], path, sizeof(path));
>
> if (policy_allows(path)) {
> struct seccomp_notif_target_call call = {
> .id = req.id,
> .nr = __NR_openat,
> .args = { AT_FDCWD, (uintptr_t)path, O_RDONLY, 0, 0, 0 },
> };
> ioctl(lfd, SECCOMP_IOCTL_NOTIF_TARGET_CALL, &call);
>
> struct seccomp_notif_resp resp = {
> .id = req.id,
> .val = call.ret >= 0 ? call.ret : 0,
> .error = call.ret >= 0 ? 0 : (int)call.ret,
> };
> ioctl(lfd, SECCOMP_IOCTL_NOTIF_SEND, &resp);
> } else {
> struct seccomp_notif_resp resp = {
> .id = req.id, .error = -EACCES,
> };
> ioctl(lfd, SECCOMP_IOCTL_NOTIF_SEND, &resp);
> }
>
>
> >
> > Or... what if there was a nice way to create a pinned mapping (and
> > verify it via seccompfd), MAP_SHARED, PROT_READ, of a memfd that the
> > supervisor owns. Then the supervisor could write syscall args into it
> > and re-point pointers into it.
>
> One concrete deployment constraint worth surfacing: Sandlock (and
> similar wrappers: Firejail, Bubblewrap-style sandboxes) work by
> fork+execve of arbitrary target binaries. The pinned-memfd
> approach needs the seal installed in a trusted window, but
> execve() replaces the address space, so anything mapped pre-exec
> is lost. The window between execve and the first instruction of
> the untrusted binary belongs to the dynamic loader (or to nothing
> at all for static binaries). not to the supervisor.

I don't think this matters. A good implementation would have the
seccomp ioctl interface (or a new syscall or whatever) be able to set
up the new pinned mapping without any particular cooperation from the
target process. So the process would start, and it would run freely
until its first syscall, and then you would install the pinned region.
And you would get notified on execve (ideally via a new notification
telling you, specifically, that the address space got cleared) so that
you know that the pinned region is gone (and that no other threads are
running concurrently in that address space!).

And (see below) you still want a way to redirect syscall args such
that they get un-redirected on return.

>
> >
> > Also, for actual correct ABI compatibility, by the time the syscall's
> > caller is resumed, the original arguments except the return value
> > should be restored, because those registers are caller-saved at least
> > on x86.
>
> The current implementation already complies.

Of course it compiles.

But someone out there probably has code that does something like:

void *param = foo;
some_syscall(param);
something_else(param);

On x86_64, param goes in RDI. Now psABI does *not* say that RDI is
preserved on return from some_syscall, so you *think* that the
compiler will reload RDI prior to calling something_else. But
syscalls don't obey psABI, and people love to inline them, and I bet
there are programs out there that used inline asm or a compiler that
doesn't target psABI (or a compiler that does but that, with
increasing use of LTO and such, can analyze some_syscall and determine
that it's inline asm inside) and they've set their constraints such
that RDI is *not* clobbered, and the generated code resembles:

SYSCALL
CALL something_else

or

CALL some_syscall_wrapper
CALL something_else

and it works! Or at least it works as long as syscall restart isn't
hitting one of its excessively weird cases here, which it usually
isn't. And then they run it under seccomp and it fails because now
RDI really is clobbered. And it's absolutely miserable to debug. And
it needs a kernel patch to fix because you don't have a clean,
performant fix in your seccomp user code.

--Andy