Re: [RFC PATCH v3 2/3] seccomp: add kernel-installed pinned-memfd redirect

From: Andy Lutomirski

Date: Sat Jun 13 2026 - 00:03:50 EST

On Fri, Jun 12, 2026 at 5:16 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
>
> Two new ioctls are introduced:
>
> SECCOMP_IOCTL_NOTIF_PIN_INSTALL
>
> Supervisor names an active notification id, a memfd it owns,
> and a target address+size. Kernel grabs the trapped task's
> mm via get_task_mm(), calls vm_mmap_pgoff_to_mm() with
> MAP_FIXED | MAP_SHARED, PROT_READ, and VM_SEALED in
> extra_vm_flags. On success the VMA is installed in the
> target's mm, immediately sealed against munmap/mremap/
> mprotect/MAP_FIXED-stomp from the target itself and any
> CLONE_VM peer. The range is recorded on the listener filter
> for SEND_REDIRECT validation.
>

I haven't read the code, but I think this at least conceptually makes
a decent amount of sense. But...

> SECCOMP_IOCTL_NOTIF_SEND_REDIRECT
>
> Resumes the trapped syscall (like FLAG_CONTINUE) with
> arg-register substitution. The supervisor supplies an
> args_mask (which arg registers to replace), a ptr_mask
> (which of those are pointers, validated to fall inside an
> installed pin) and replacement values. The kernel saves
> the trapped task's original arg registers into a small
> heap record, writes substituted values via
> syscall_set_arguments(), and queues a task_work callback
> that fires at user-mode return after the syscall completes
> to restore the original registers. This preserves the
> caller-saved arg-register ABI invariant for callers that
> expected register contents to survive across the syscall
> (compilers under LTO, inline-asm syscall wrappers, anything
> that doesn't strictly follow psABI).

Here there be dragons, and I kind of alluded to some of those dragons
in my recent message about STRICT, but let's be more thorough.

I'm going to totally ignore the implementation for now (which I think
has a memory leak, but whatever -- this is solvable, at least in
principle). Conceptually, SEND_REDIRECT is handling a seccomp action
by doing a syscall that may be different from the originally requested
syscall. And we have a whole host of potential issues, some related
to security and some related to functionality.

Let's do the functionality ones first: what happens if a signal
happens? In the simplest cases (signal completely ignored, task
killed (there's the memory leak), or -EINTR), I think we're mostly
okay. But in the case where the syscall needs to restart or, worse,
use one of the fancy restart techniques, what should happen? I think
that even defining semantics is somewhat nontrivial, and I'm a bit
concerned that the user notifier would need to actually be aware of
signals. Yuck.

Now security: right now we have this rule:

/*
* All BPF programs must return a 32-bit value.
* The bottom 16-bits are for optional return data.
* The upper 16-bits are ordered from least permissive values to most,
* as a signed value (so 0x8000000 is negative).
*
* The ordering ensures that a min_t() over composed return values always
* selects the least permissive choice.
*/
#define SECCOMP_RET_KILL_PROCESS 0x80000000U /* kill the process */
#define SECCOMP_RET_KILL_THREAD 0x00000000U /* kill the thread */
#define SECCOMP_RET_KILL SECCOMP_RET_KILL_THREAD
#define SECCOMP_RET_TRAP 0x00030000U /* disallow and force a SIGSYS */
#define SECCOMP_RET_ERRNO 0x00050000U /* returns an errno */
#define SECCOMP_RET_USER_NOTIF 0x7fc00000U /* notifies userspace */
#define SECCOMP_RET_TRACE 0x7ff00000U /* pass to a tracer or disallow */
#define SECCOMP_RET_LOG 0x7ffc0000U /* allow after logging */
#define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */

This has always bothered me. In the absence of USER_NOTIF and TRACE,
fine, I guess -- we're choosing the least permissive, and this doesn't
seem too crazy. But if we do anything fancy (like this patch series),
I think this becomes wrong. (And I kind of think I said something
along these lines many years ago.)

Before this series (in current kernels), one can do syscall emulation
using SECCOMP_RET_TRACE, and it kind of works as long as no filter in
the stack tries to block the original syscall *or* the rewritten
syscall, because syscalls issued by using ptrace to redirect the
traced process go through seccomp again. It's a total mess, it can't
handle complex cases, but it's at least approximately secure.

With this series, I think it's all busted. Suppose I make a container
and block everyone's favorite unshare, generating SECCOMP_RET_ERRNO.
(This is the default "docker" (actually moby I think) policy.) Then,
inside the container, I write a program that installs a filter that
sends syscall 12345 to USER_NOTIF. Then I fork and my child does
syscall 12345. I handle USER_NOTIF by using the new redirect feature
to redirect to unshare(). And unshare() gets called.

IMHO what *should* happen is that we actually keep track of where we
are in the seccomp filter stack. We start from the innermost filter
(most recently applied) and start running the filters. And then we do
something that actually makes sense based on the result. For example:

KILL: Kill it. Do not run more filters. (I suppose we could see if
an outer filter promotes from KILL_THREAD to KILL_PROCESS, but this
doesn't seem helpful.)
TRAP: Generate the signal. Do *not* run more filters. Sure, this
can allow a contained program to generate a SIGSYS instead of getting
killed if it tries some blocked syscall that the outer filter wants to
KILL. So what? I actually think this is better behavior -- the
combination of the program and inner filter is not actually doing the
syscall.

ERRNO: Same deal -- replace the syscall with a return of the specified
value. Don't call more filters.

TRACE: Similar.

ALLOW: Call the next filter.

USER_NOTIF: Stop calling filters and remember where we are in the
filter chain. Call out to the user notifier *associated with this
filter*. When the user notifier responds, if the notifier asks for a
redirect or to resume the syscall, then continue calling filters *on
the new syscall*.

Looking at my example above, the effect would be that the inner filter
gets a notifier event for syscall 12345 and redirects to unshare.
Then the outer filter sees unshare. It can ERROR to cause unshare to
return an error, or it can do its own USER_NOTIF to do something fancy
with unshare, or it can KILL, etc.

This may be enough of a scary departure that we will want each filter
to opt in to the new behavior for filters applied later. Or maybe
everyone can get comfortable enough with it to just switch over. Or
maybe there's another solution. Or maybe someone can try to convince
me that the existing behavior makes sense if syscalls can be
redirected (maybe call the whole chain on the redirected syscall? Even
defining that gets a little messy.)

--Andy