Re: [RFC PATCH v2 0/3] seccomp: SECCOMP_IOCTL_NOTIF_INJECT for race-free unotify

From: Andy Lutomirski

Date: Tue May 26 2026 - 15:03:50 EST

On Thu, May 14, 2026 at 9:27 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
>
> From: Cong Wang <cwang@xxxxxxxxxxxxxx>
>
> This is a complete rework of v1 (PIN_ARGS), reshaped to address the
> review feedback that having every syscall-arg fetch site consult a
> per-task pin pointer is cross-cutting awareness that does not scale.
>
> v1 thread:
> https://lore.kernel.org/lkml/20260504011207.539408-1-xiyou.wangcong@xxxxxxxxx
>
> ## Changes since v1

Here are some thoughts:

>
> v2 inverts the model. The supervisor no longer pins args for a
> resumed syscall body to consume; it describes a substitute syscall
> (nr + args[6]) whose pointer-shaped args are encoded as byte offsets
> into a kernel-side buffer. On SECCOMP_USER_NOTIF_FLAG_INJECTED, the
> trapped task wakes inside seccomp_do_user_notification(), dispatches
> into a kernel-mode syscall helper (filp_open / kernel_bind /
> kernel_write for v1), and the helper's return value becomes the
> trapped syscall's return value. The trapped task's user mm is never
> re-read for the substituted syscall.

This sounds like it could be done well or it could be done poorly.
Doing it poorly sounds like it would resemble set_fs(), and set_fs()
was awful. Please don't reintroduce it or anything like it.

Doing it well sounds like introducing a bunch of new entrypoints. In
some sense this seems like a nice plan, except that essentially every
syscall doesn't work like that. So getting any sort of decent
coverage could involve extensive kernel changes and might involve
adding lots of new entrypoints that are only used for this new system,
which isn't great.

> The full motivation, including the threat model (adversarial AI
> agents in the same address space) and the concrete user (Sandlock,
> https://github.com/multikernel/sandlock), is in the v1 cover letter
> above.

Whoa there. "adversarial AI agents" aren't a threat model that makes
sense in this context. I think you mean "multiple tasks, all running
untrusted code, potentially sharing an address space".

But here are some other thoughts:

> v1 injectable-syscall whitelist:
>
> - openat (filp_open + fd_install)
> - bind (sockfd_lookup + kernel_bind)
> - write (kernel_write)

How gnarly would an actual API for this be? By "actual API" I mean an
fd that represents complete control over a target task (which the
existing seccomp fd sort of is) and syscalls issued against that fd
that do openat, bind, read, write, etc.

Or... what if there was a nice way to create a pinned mapping (and
verify it via seccompfd), MAP_SHARED, PROT_READ, of a memfd that the
supervisor owns. Then the supervisor could write syscall args into it
and re-point pointers into it.

Also, for actual correct ABI compatibility, by the time the syscall's
caller is resumed, the original arguments except the return value
should be restored, because those registers are caller-saved at least
on x86.

--Andy