Re: [RFC PATCH v2 0/3] seccomp: SECCOMP_IOCTL_NOTIF_INJECT for race-free unotify

From: Cong Wang

Date: Thu May 28 2026 - 13:48:43 EST

Hi Andy,

On Tue, May 26, 2026 at 12:03 PM Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>
> On Thu, May 14, 2026 at 9:27 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
> >
> > From: Cong Wang <cwang@xxxxxxxxxxxxxx>
> >
> > This is a complete rework of v1 (PIN_ARGS), reshaped to address the
> > review feedback that having every syscall-arg fetch site consult a
> > per-task pin pointer is cross-cutting awareness that does not scale.
> >
> > v1 thread:
> > https://lore.kernel.org/lkml/20260504011207.539408-1-xiyou.wangcong@xxxxxxxxx
> >
> > ## Changes since v1
>
> Here are some thoughts:
>
> >
> > v2 inverts the model. The supervisor no longer pins args for a
> > resumed syscall body to consume; it describes a substitute syscall
> > (nr + args[6]) whose pointer-shaped args are encoded as byte offsets
> > into a kernel-side buffer. On SECCOMP_USER_NOTIF_FLAG_INJECTED, the
> > trapped task wakes inside seccomp_do_user_notification(), dispatches
> > into a kernel-mode syscall helper (filp_open / kernel_bind /
> > kernel_write for v1), and the helper's return value becomes the
> > trapped syscall's return value. The trapped task's user mm is never
> > re-read for the substituted syscall.
>
> This sounds like it could be done well or it could be done poorly.
> Doing it poorly sounds like it would resemble set_fs(), and set_fs()
> was awful. Please don't reintroduce it or anything like it.

Noted.

For v2, the injectors call filp_open(), kernel_bind(), kernel_write(),
kernel-pointer entrypoints that exist specifically so kernel callers
don't have to spoof user-space.

>
> Doing it well sounds like introducing a bunch of new entrypoints. In
> some sense this seems like a nice plan, except that essentially every
> syscall doesn't work like that. So getting any sort of decent
> coverage could involve extensive kernel changes and might involve
> adding lots of new entrypoints that are only used for this new system,
> which isn't great.

This is a valid concern.

Currently, we only have 3 entrypoints which all have pre-existing
kernel-side API's. In the future, we may need to extend it, for example,
for execve().

However, the number of syscalls we inject is still very small, compared
with the total number of syscalls on Linux. I'd never anticipate this list
to grow beyond 10, since most of the syscall injections don't have
TOCTOU issues at all.

>
> > The full motivation, including the threat model (adversarial AI
> > agents in the same address space) and the concrete user (Sandlock,
> > https://github.com/multikernel/sandlock), is in the v1 cover letter
> > above.
>
> Whoa there. "adversarial AI agents" aren't a threat model that makes
> sense in this context. I think you mean "multiple tasks, all running
> untrusted code, potentially sharing an address space".

Right, I will update the wording.

>
>
> But here are some other thoughts:
>
> > v1 injectable-syscall whitelist:
> >
> > - openat (filp_open + fd_install)
> > - bind (sockfd_lookup + kernel_bind)
> > - write (kernel_write)
>
> How gnarly would an actual API for this be? By "actual API" I mean an
> fd that represents complete control over a target task (which the
> existing seccomp fd sort of is) and syscalls issued against that fd
> that do openat, bind, read, write, etc.

Excellent suggestion! How about the following API?

ioctl(lfd, SECCOMP_IOCTL_NOTIF_RECV, &req);
/* req.data.nr == __NR_openat; args[1] is target's pointer. */

read_target_string(req.pid, req.data.args[1], path, sizeof(path));

if (policy_allows(path)) {
struct seccomp_notif_target_call call = {
.id = req.id,
.nr = __NR_openat,
.args = { AT_FDCWD, (uintptr_t)path, O_RDONLY, 0, 0, 0 },
};
ioctl(lfd, SECCOMP_IOCTL_NOTIF_TARGET_CALL, &call);

struct seccomp_notif_resp resp = {
.id = req.id,
.val = call.ret >= 0 ? call.ret : 0,
.error = call.ret >= 0 ? 0 : (int)call.ret,
};
ioctl(lfd, SECCOMP_IOCTL_NOTIF_SEND, &resp);
} else {
struct seccomp_notif_resp resp = {
.id = req.id, .error = -EACCES,
};
ioctl(lfd, SECCOMP_IOCTL_NOTIF_SEND, &resp);
}

>
> Or... what if there was a nice way to create a pinned mapping (and
> verify it via seccompfd), MAP_SHARED, PROT_READ, of a memfd that the
> supervisor owns. Then the supervisor could write syscall args into it
> and re-point pointers into it.

One concrete deployment constraint worth surfacing: Sandlock (and
similar wrappers: Firejail, Bubblewrap-style sandboxes) work by
fork+execve of arbitrary target binaries. The pinned-memfd
approach needs the seal installed in a trusted window, but
execve() replaces the address space, so anything mapped pre-exec
is lost. The window between execve and the first instruction of
the untrusted binary belongs to the dynamic loader (or to nothing
at all for static binaries). not to the supervisor.

>
> Also, for actual correct ABI compatibility, by the time the syscall's
> caller is resumed, the original arguments except the return value
> should be restored, because those registers are caller-saved at least
> on x86.

The current implementation already complies.

Thanks!