Re: [RFC PATCH v3 2/3] seccomp: add kernel-installed pinned-memfd redirect

From: Cong Wang

Date: Tue Jun 23 2026 - 19:26:36 EST

On Tue, Jun 23, 2026 at 12:02 PM Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>
> On Sat, Jun 20, 2026 at 2:12 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
> >
> > On Fri, Jun 12, 2026 at 9:03 PM Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> > >
> > > On Fri, Jun 12, 2026 at 5:16 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
> > > >
> > > > Two new ioctls are introduced:
> > > >
> > > > SECCOMP_IOCTL_NOTIF_PIN_INSTALL
> > > >
> > > > Supervisor names an active notification id, a memfd it owns,
> > > > and a target address+size. Kernel grabs the trapped task's
> > > > mm via get_task_mm(), calls vm_mmap_pgoff_to_mm() with
> > > > MAP_FIXED | MAP_SHARED, PROT_READ, and VM_SEALED in
> > > > extra_vm_flags. On success the VMA is installed in the
> > > > target's mm, immediately sealed against munmap/mremap/
> > > > mprotect/MAP_FIXED-stomp from the target itself and any
> > > > CLONE_VM peer. The range is recorded on the listener filter
> > > > for SEND_REDIRECT validation.
> > > >
> > >
> > > I haven't read the code, but I think this at least conceptually makes
> > > a decent amount of sense. But...
> > >
> > > > SECCOMP_IOCTL_NOTIF_SEND_REDIRECT
> > > >
> > > > Resumes the trapped syscall (like FLAG_CONTINUE) with
> > > > arg-register substitution. The supervisor supplies an
> > > > args_mask (which arg registers to replace), a ptr_mask
> > > > (which of those are pointers, validated to fall inside an
> > > > installed pin) and replacement values. The kernel saves
> > > > the trapped task's original arg registers into a small
> > > > heap record, writes substituted values via
> > > > syscall_set_arguments(), and queues a task_work callback
> > > > that fires at user-mode return after the syscall completes
> > > > to restore the original registers. This preserves the
> > > > caller-saved arg-register ABI invariant for callers that
> > > > expected register contents to survive across the syscall
> > > > (compilers under LTO, inline-asm syscall wrappers, anything
> > > > that doesn't strictly follow psABI).
> > >
> > > Here there be dragons, and I kind of alluded to some of those dragons
> > > in my recent message about STRICT, but let's be more thorough.
> > >
> > > I'm going to totally ignore the implementation for now (which I think
> > > has a memory leak, but whatever -- this is solvable, at least in
> >
> > Yes, let's get the design correct before digging into any detail.
> >
> > > principle). Conceptually, SEND_REDIRECT is handling a seccomp action
> > > by doing a syscall that may be different from the originally requested
> > > syscall. And we have a whole host of potential issues, some related
> > > to security and some related to functionality.
> > >
> > > Let's do the functionality ones first: what happens if a signal
> > > happens? In the simplest cases (signal completely ignored, task
> > > killed (there's the memory leak), or -EINTR), I think we're mostly
> > > okay. But in the case where the syscall needs to restart or, worse,
> > > use one of the fancy restart techniques, what should happen? I think
> > > that even defining semantics is somewhat nontrivial, and I'm a bit
> > > concerned that the user notifier would need to actually be aware of
> > > signals. Yuck.
> >
> > Good catch! I completely missed the signal case, how about restoring
> > at syscall-exit, before signal/restart processing?
>
> I think that handling this nicely is extremely complex.
>
> One could imagine a fictional universe where Linux works like this:
>
> void user_asked_for_a_syscall(nr, args, etc)
> {
> do preprocessing;
>
> handle seccomp;
>
> ret = actually do the syscall;
>
> if (ret says a signal happened) {
> do the horrid magical signal fixup;
> }
> }
>
> But Linux doesn't, and cannot quite, work this way. On a syscall
> entry, first there's seccomp. Then, after seccomp *returns* (and even
> the function that called it returns!), we make it to the actual
> syscall. Then we make it to later code that notices a signal and
> deals with restart, and x86 even has two copies of this (handle_signal
> and arch_do_signal_or_restart, both in the same file), and they work
> differently.
>
> Now, on the bright side, the actual semantics are sort of all in the
> syscall itself and its return value, along with (x86-specific) orig_ax
> (see the orig_ax = -1 in restore_sigcontext). So maybe one could
> rearrange the syscall code so that seccomp can actually see the return
> value. And this might actually be an excellent idea, although it
> would need to be done with quite a bit of care.

I think it is a very good idea for the long term. Since it touches all
the arch'es, it also requires much more effort. I prefer to defer it
to the future. (Fortunately, it does not require ABI change.)

For the short term, I think fixup with task_work_add(TWA_SIGNAL)
is still the best on the table.

>
> And this whole mess is kind of neccesary: it's possible for user code
> to do a syscall, get a signal, and have a handler for that signal. So
> the kernel will rewrite the return state so it returns to the handler.
> Then the handler returns, via sigreturn, and sigreturn needs to be
> able to *resume a potentially interrupted syscall*. So the kernel
> needs to set up the stack so that this happens. There is no actual
> guarantee that any of this matches up correctly -- what if the user
> does usermode threading and resumes on a different thread?
> Regrettably, the kernel has the restartblock mechanism and does keep
> some limited state, and this sucks and has nasty corner cases, and I
> really don't think we want to expose this to seccomp.
>
>
> One solution is to declare that, for now, we will only allow one user
> notifier in the stack or at least only one that declares its intention
> to use the redirect feature. Even with this, the fact that we kind of
> need to fix up registers after a redirected syscall is a mess, but at
> least *that* mess can be fixed in a sort of one-deep sense by making
> sure that we fix it after precisely the one syscall we issued (which
> is roughly what your patch does).

I think the latter one is better, since it only constrains redirect-capable
notifiers, leaves ordinary ones alone. I will incorporate this into the next
update.

>
> >
> > >
> > > Now security: right now we have this rule:
> > >
> > > /*
> > > * All BPF programs must return a 32-bit value.
> > > * The bottom 16-bits are for optional return data.
> > > * The upper 16-bits are ordered from least permissive values to most,
> > > * as a signed value (so 0x8000000 is negative).
> > > *
> > > * The ordering ensures that a min_t() over composed return values always
> > > * selects the least permissive choice.
> > > */
> > > #define SECCOMP_RET_KILL_PROCESS 0x80000000U /* kill the process */
> > > #define SECCOMP_RET_KILL_THREAD 0x00000000U /* kill the thread */
> > > #define SECCOMP_RET_KILL SECCOMP_RET_KILL_THREAD
> > > #define SECCOMP_RET_TRAP 0x00030000U /* disallow and force a SIGSYS */
> > > #define SECCOMP_RET_ERRNO 0x00050000U /* returns an errno */
> > > #define SECCOMP_RET_USER_NOTIF 0x7fc00000U /* notifies userspace */
> > > #define SECCOMP_RET_TRACE 0x7ff00000U /* pass to a tracer or disallow */
> > > #define SECCOMP_RET_LOG 0x7ffc0000U /* allow after logging */
> > > #define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */
> > >
> > > This has always bothered me. In the absence of USER_NOTIF and TRACE,
> > > fine, I guess -- we're choosing the least permissive, and this doesn't
> > > seem too crazy. But if we do anything fancy (like this patch series),
> > > I think this becomes wrong. (And I kind of think I said something
> > > along these lines many years ago.)
> > >
> > > Before this series (in current kernels), one can do syscall emulation
> > > using SECCOMP_RET_TRACE, and it kind of works as long as no filter in
> > > the stack tries to block the original syscall *or* the rewritten
> > > syscall, because syscalls issued by using ptrace to redirect the
> > > traced process go through seccomp again. It's a total mess, it can't
> > > handle complex cases, but it's at least approximately secure.
> > >
> > > With this series, I think it's all busted. Suppose I make a container
> > > and block everyone's favorite unshare, generating SECCOMP_RET_ERRNO.
> > > (This is the default "docker" (actually moby I think) policy.) Then,
> > > inside the container, I write a program that installs a filter that
> > > sends syscall 12345 to USER_NOTIF. Then I fork and my child does
> > > syscall 12345. I handle USER_NOTIF by using the new redirect feature
> > > to redirect to unshare(). And unshare() gets called.
> > >
> > > IMHO what *should* happen is that we actually keep track of where we
> > > are in the seccomp filter stack. We start from the innermost filter
> > > (most recently applied) and start running the filters. And then we do
> > > something that actually makes sense based on the result. For example:
> > >
> > > KILL: Kill it. Do not run more filters. (I suppose we could see if
> > > an outer filter promotes from KILL_THREAD to KILL_PROCESS, but this
> > > doesn't seem helpful.)
> > > TRAP: Generate the signal. Do *not* run more filters. Sure, this
> > > can allow a contained program to generate a SIGSYS instead of getting
> > > killed if it tries some blocked syscall that the outer filter wants to
> > > KILL. So what? I actually think this is better behavior -- the
> > > combination of the program and inner filter is not actually doing the
> > > syscall.
> > >
> > > ERRNO: Same deal -- replace the syscall with a return of the specified
> > > value. Don't call more filters.
> > >
> > > TRACE: Similar.
> > >
> > > ALLOW: Call the next filter.
> > >
> > > USER_NOTIF: Stop calling filters and remember where we are in the
> > > filter chain. Call out to the user notifier *associated with this
> > > filter*. When the user notifier responds, if the notifier asks for a
> > > redirect or to resume the syscall, then continue calling filters *on
> > > the new syscall*.
> > >
> > > Looking at my example above, the effect would be that the inner filter
> > > gets a notifier event for syscall 12345 and redirects to unshare.
> > > Then the outer filter sees unshare. It can ERROR to cause unshare to
> > > return an error, or it can do its own USER_NOTIF to do something fancy
> > > with unshare, or it can KILL, etc.
> > >
> > >
> > > This may be enough of a scary departure that we will want each filter
> > > to opt in to the new behavior for filters applied later. Or maybe
> > > everyone can get comfortable enough with it to just switch over. Or
> > > maybe there's another solution. Or maybe someone can try to convince
> > > me that the existing behavior makes sense if syscalls can be
> > > redirected (maybe call the whole chain on the redirected syscall? Even
> > > defining that gets a little messy.)
> >
> > Thanks for the detailed analysis!
> >
> > How about keeping min and re-run only the outer suffix after a redirect?
> >
> > I think this is the safest option. I agree your suggestion of removing min
> > is more elegant, but it also brings risks of breaking existing filtering logic.
>
> I'm really not convinced that the min is needed to preserve any useful
> behavior. But Kees is very conservative about these things, with good
> reason.
>
> I'm also not sure what happens if we do a redirect and then discover
> that the outer rules trigger a user notifier. We need *some*
> semantics in this case.

Agreed, and you're right that anchoring the patch on "preserve the min"
was the wrong framing. I've reworked it: there's no min re-run anymore.

The first pass over the full stack is unchanged (still the min, still the
allow-cache, so nothing changes for existing non-redirect users), but the
redirect path is now a sequential continuation rather than a second min
evaluation.

After a redirect rewrites the registers, evaluation resumes at the filter
outer to the one that notified (match->prev) and walks strictly toward the
root, one filter at a time, stopping at the first that doesn't allow the
substituted syscall (ALLOW and LOG fall through). seccomp_run_filters()
is reverted to its original signature; the continuation lives in a small
separate helper, so the common path is untouched again.

Please let me know if you agree this is a better direction.

Thanks,
Cong