Re: [RFC PATCH v3 2/3] seccomp: add kernel-installed pinned-memfd redirect

From: Andy Lutomirski

Date: Tue Jun 23 2026 - 15:04:48 EST

On Sat, Jun 20, 2026 at 2:12 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
>
> On Fri, Jun 12, 2026 at 9:03 PM Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> >
> > On Fri, Jun 12, 2026 at 5:16 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
> > >
> > > Two new ioctls are introduced:
> > >
> > > SECCOMP_IOCTL_NOTIF_PIN_INSTALL
> > >
> > > Supervisor names an active notification id, a memfd it owns,
> > > and a target address+size. Kernel grabs the trapped task's
> > > mm via get_task_mm(), calls vm_mmap_pgoff_to_mm() with
> > > MAP_FIXED | MAP_SHARED, PROT_READ, and VM_SEALED in
> > > extra_vm_flags. On success the VMA is installed in the
> > > target's mm, immediately sealed against munmap/mremap/
> > > mprotect/MAP_FIXED-stomp from the target itself and any
> > > CLONE_VM peer. The range is recorded on the listener filter
> > > for SEND_REDIRECT validation.
> > >
> >
> > I haven't read the code, but I think this at least conceptually makes
> > a decent amount of sense. But...
> >
> > > SECCOMP_IOCTL_NOTIF_SEND_REDIRECT
> > >
> > > Resumes the trapped syscall (like FLAG_CONTINUE) with
> > > arg-register substitution. The supervisor supplies an
> > > args_mask (which arg registers to replace), a ptr_mask
> > > (which of those are pointers, validated to fall inside an
> > > installed pin) and replacement values. The kernel saves
> > > the trapped task's original arg registers into a small
> > > heap record, writes substituted values via
> > > syscall_set_arguments(), and queues a task_work callback
> > > that fires at user-mode return after the syscall completes
> > > to restore the original registers. This preserves the
> > > caller-saved arg-register ABI invariant for callers that
> > > expected register contents to survive across the syscall
> > > (compilers under LTO, inline-asm syscall wrappers, anything
> > > that doesn't strictly follow psABI).
> >
> > Here there be dragons, and I kind of alluded to some of those dragons
> > in my recent message about STRICT, but let's be more thorough.
> >
> > I'm going to totally ignore the implementation for now (which I think
> > has a memory leak, but whatever -- this is solvable, at least in
>
> Yes, let's get the design correct before digging into any detail.
>
> > principle). Conceptually, SEND_REDIRECT is handling a seccomp action
> > by doing a syscall that may be different from the originally requested
> > syscall. And we have a whole host of potential issues, some related
> > to security and some related to functionality.
> >
> > Let's do the functionality ones first: what happens if a signal
> > happens? In the simplest cases (signal completely ignored, task
> > killed (there's the memory leak), or -EINTR), I think we're mostly
> > okay. But in the case where the syscall needs to restart or, worse,
> > use one of the fancy restart techniques, what should happen? I think
> > that even defining semantics is somewhat nontrivial, and I'm a bit
> > concerned that the user notifier would need to actually be aware of
> > signals. Yuck.
>
> Good catch! I completely missed the signal case, how about restoring
> at syscall-exit, before signal/restart processing?

I think that handling this nicely is extremely complex.

One could imagine a fictional universe where Linux works like this:

void user_asked_for_a_syscall(nr, args, etc)
{
do preprocessing;

handle seccomp;

ret = actually do the syscall;

if (ret says a signal happened) {
do the horrid magical signal fixup;
}
}

But Linux doesn't, and cannot quite, work this way. On a syscall
entry, first there's seccomp. Then, after seccomp *returns* (and even
the function that called it returns!), we make it to the actual
syscall. Then we make it to later code that notices a signal and
deals with restart, and x86 even has two copies of this (handle_signal
and arch_do_signal_or_restart, both in the same file), and they work
differently.

Now, on the bright side, the actual semantics are sort of all in the
syscall itself and its return value, along with (x86-specific) orig_ax
(see the orig_ax = -1 in restore_sigcontext). So maybe one could
rearrange the syscall code so that seccomp can actually see the return
value. And this might actually be an excellent idea, although it
would need to be done with quite a bit of care.

And this whole mess is kind of neccesary: it's possible for user code
to do a syscall, get a signal, and have a handler for that signal. So
the kernel will rewrite the return state so it returns to the handler.
Then the handler returns, via sigreturn, and sigreturn needs to be
able to *resume a potentially interrupted syscall*. So the kernel
needs to set up the stack so that this happens. There is no actual
guarantee that any of this matches up correctly -- what if the user
does usermode threading and resumes on a different thread?
Regrettably, the kernel has the restartblock mechanism and does keep
some limited state, and this sucks and has nasty corner cases, and I
really don't think we want to expose this to seccomp.

One solution is to declare that, for now, we will only allow one user
notifier in the stack or at least only one that declares its intention
to use the redirect feature. Even with this, the fact that we kind of
need to fix up registers after a redirected syscall is a mess, but at
least *that* mess can be fixed in a sort of one-deep sense by making
sure that we fix it after precisely the one syscall we issued (which
is roughly what your patch does).

>
> >
> > Now security: right now we have this rule:
> >
> > /*
> > * All BPF programs must return a 32-bit value.
> > * The bottom 16-bits are for optional return data.
> > * The upper 16-bits are ordered from least permissive values to most,
> > * as a signed value (so 0x8000000 is negative).
> > *
> > * The ordering ensures that a min_t() over composed return values always
> > * selects the least permissive choice.
> > */
> > #define SECCOMP_RET_KILL_PROCESS 0x80000000U /* kill the process */
> > #define SECCOMP_RET_KILL_THREAD 0x00000000U /* kill the thread */
> > #define SECCOMP_RET_KILL SECCOMP_RET_KILL_THREAD
> > #define SECCOMP_RET_TRAP 0x00030000U /* disallow and force a SIGSYS */
> > #define SECCOMP_RET_ERRNO 0x00050000U /* returns an errno */
> > #define SECCOMP_RET_USER_NOTIF 0x7fc00000U /* notifies userspace */
> > #define SECCOMP_RET_TRACE 0x7ff00000U /* pass to a tracer or disallow */
> > #define SECCOMP_RET_LOG 0x7ffc0000U /* allow after logging */
> > #define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */
> >
> > This has always bothered me. In the absence of USER_NOTIF and TRACE,
> > fine, I guess -- we're choosing the least permissive, and this doesn't
> > seem too crazy. But if we do anything fancy (like this patch series),
> > I think this becomes wrong. (And I kind of think I said something
> > along these lines many years ago.)
> >
> > Before this series (in current kernels), one can do syscall emulation
> > using SECCOMP_RET_TRACE, and it kind of works as long as no filter in
> > the stack tries to block the original syscall *or* the rewritten
> > syscall, because syscalls issued by using ptrace to redirect the
> > traced process go through seccomp again. It's a total mess, it can't
> > handle complex cases, but it's at least approximately secure.
> >
> > With this series, I think it's all busted. Suppose I make a container
> > and block everyone's favorite unshare, generating SECCOMP_RET_ERRNO.
> > (This is the default "docker" (actually moby I think) policy.) Then,
> > inside the container, I write a program that installs a filter that
> > sends syscall 12345 to USER_NOTIF. Then I fork and my child does
> > syscall 12345. I handle USER_NOTIF by using the new redirect feature
> > to redirect to unshare(). And unshare() gets called.
> >
> > IMHO what *should* happen is that we actually keep track of where we
> > are in the seccomp filter stack. We start from the innermost filter
> > (most recently applied) and start running the filters. And then we do
> > something that actually makes sense based on the result. For example:
> >
> > KILL: Kill it. Do not run more filters. (I suppose we could see if
> > an outer filter promotes from KILL_THREAD to KILL_PROCESS, but this
> > doesn't seem helpful.)
> > TRAP: Generate the signal. Do *not* run more filters. Sure, this
> > can allow a contained program to generate a SIGSYS instead of getting
> > killed if it tries some blocked syscall that the outer filter wants to
> > KILL. So what? I actually think this is better behavior -- the
> > combination of the program and inner filter is not actually doing the
> > syscall.
> >
> > ERRNO: Same deal -- replace the syscall with a return of the specified
> > value. Don't call more filters.
> >
> > TRACE: Similar.
> >
> > ALLOW: Call the next filter.
> >
> > USER_NOTIF: Stop calling filters and remember where we are in the
> > filter chain. Call out to the user notifier *associated with this
> > filter*. When the user notifier responds, if the notifier asks for a
> > redirect or to resume the syscall, then continue calling filters *on
> > the new syscall*.
> >
> > Looking at my example above, the effect would be that the inner filter
> > gets a notifier event for syscall 12345 and redirects to unshare.
> > Then the outer filter sees unshare. It can ERROR to cause unshare to
> > return an error, or it can do its own USER_NOTIF to do something fancy
> > with unshare, or it can KILL, etc.
> >
> >
> > This may be enough of a scary departure that we will want each filter
> > to opt in to the new behavior for filters applied later. Or maybe
> > everyone can get comfortable enough with it to just switch over. Or
> > maybe there's another solution. Or maybe someone can try to convince
> > me that the existing behavior makes sense if syscalls can be
> > redirected (maybe call the whole chain on the redirected syscall? Even
> > defining that gets a little messy.)
>
> Thanks for the detailed analysis!
>
> How about keeping min and re-run only the outer suffix after a redirect?
>
> I think this is the safest option. I agree your suggestion of removing min
> is more elegant, but it also brings risks of breaking existing filtering logic.

I'm really not convinced that the min is needed to preserve any useful
behavior. But Kees is very conservative about these things, with good
reason.

I'm also not sure what happens if we do a redirect and then discover
that the outer rules trigger a user notifier. We need *some*
semantics in this case.

--Andy