Re: [RFC PATCH v2 0/3] seccomp: SECCOMP_IOCTL_NOTIF_INJECT for race-free unotify
From: Cong Wang
Date: Fri May 29 2026 - 01:07:55 EST
On Thu, May 28, 2026 at 11:15 AM Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>
> On Thu, May 28, 2026 at 10:42 AM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
> >
> > Hi Andy,
> >
> > On Tue, May 26, 2026 at 12:03 PM Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> > >
> > > Or... what if there was a nice way to create a pinned mapping (and
> > > verify it via seccompfd), MAP_SHARED, PROT_READ, of a memfd that the
> > > supervisor owns. Then the supervisor could write syscall args into it
> > > and re-point pointers into it.
> >
> > One concrete deployment constraint worth surfacing: Sandlock (and
> > similar wrappers: Firejail, Bubblewrap-style sandboxes) work by
> > fork+execve of arbitrary target binaries. The pinned-memfd
> > approach needs the seal installed in a trusted window, but
> > execve() replaces the address space, so anything mapped pre-exec
> > is lost. The window between execve and the first instruction of
> > the untrusted binary belongs to the dynamic loader (or to nothing
> > at all for static binaries). not to the supervisor.
>
> I don't think this matters. A good implementation would have the
> seccomp ioctl interface (or a new syscall or whatever) be able to set
> up the new pinned mapping without any particular cooperation from the
> target process. So the process would start, and it would run freely
> until its first syscall, and then you would install the pinned region.
> And you would get notified on execve (ideally via a new notification
> telling you, specifically, that the address space got cleared) so that
> you know that the pinned region is gone (and that no other threads are
> running concurrently in that address space!).
You are right. I thought it would be hard to implement this non-cooperative
pinned memfd, it turns out it is much easier than I thought.
Please let me know your thoughts on the following design:
/* 1. Supervisor receives a trap. */
ioctl(listener_fd, SECCOMP_IOCTL_NOTIF_RECV, &req);
/* 2. Install a sealed pin in the trapped task's mm. */
struct seccomp_notif_pin_install pin = {
.id = req.id,
.memfd = my_memfd,
.target_addr = PIN_ADDR,
.size = PIN_SIZE,
};
ioctl(listener_fd, SECCOMP_IOCTL_NOTIF_PIN_INSTALL, &pin);
/* 3. Write the substitute arg into the pin via our own memfd view. */
strcpy(sup_view, "/dev/null");
/* 4. Redirect args[1] into the pin and resume the syscall. */
struct seccomp_notif_resp_redirect redir = {
.id = req.id,
.flags = SECCOMP_REDIRECT_FLAG_CONTINUE,
.args_mask = 1U << 1,
.ptr_mask = 1U << 1,
.args = { 0, PIN_ADDR, 0, 0, 0, 0 },
};
ioctl(listener_fd, SECCOMP_IOCTL_NOTIF_SEND_REDIRECT, &redir);
>
> And (see below) you still want a way to redirect syscall args such
> that they get un-redirected on return.
>
> >
> > >
> > > Also, for actual correct ABI compatibility, by the time the syscall's
> > > caller is resumed, the original arguments except the return value
> > > should be restored, because those registers are caller-saved at least
> > > on x86.
> >
> > The current implementation already complies.
>
> Of course it compiles.
>
> But someone out there probably has code that does something like:
>
> void *param = foo;
> some_syscall(param);
> something_else(param);
>
> On x86_64, param goes in RDI. Now psABI does *not* say that RDI is
> preserved on return from some_syscall, so you *think* that the
> compiler will reload RDI prior to calling something_else. But
> syscalls don't obey psABI, and people love to inline them, and I bet
> there are programs out there that used inline asm or a compiler that
> doesn't target psABI (or a compiler that does but that, with
> increasing use of LTO and such, can analyze some_syscall and determine
> that it's inline asm inside) and they've set their constraints such
> that RDI is *not* clobbered, and the generated code resembles:
The above pinned-memfd handles this directly. SEND_REDIRECT
saves the trapped task's original arg registers into the knotif before calling
syscall_set_arguments() with the supervisor's substituted values, and
queues a task_work via task_work_add(TWA_RESUME). The callback
fires at the user-mode boundary in syscall_exit_to_user_mode_work,
before control returns to userspace, and rewrites the masked positions
back to the saved originals via syscall_set_arguments(). The caller
observes its original register contents on the resume.
Thanks!