Re: [RFC PATCH v3 2/3] seccomp: add kernel-installed pinned-memfd redirect

From: Cong Wang

Date: Sat Jun 20 2026 - 17:12:22 EST

On Fri, Jun 12, 2026 at 9:03 PM Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>
> On Fri, Jun 12, 2026 at 5:16 PM Cong Wang <xiyou.wangcong@xxxxxxxxx> wrote:
> >
> > Two new ioctls are introduced:
> >
> > SECCOMP_IOCTL_NOTIF_PIN_INSTALL
> >
> > Supervisor names an active notification id, a memfd it owns,
> > and a target address+size. Kernel grabs the trapped task's
> > mm via get_task_mm(), calls vm_mmap_pgoff_to_mm() with
> > MAP_FIXED | MAP_SHARED, PROT_READ, and VM_SEALED in
> > extra_vm_flags. On success the VMA is installed in the
> > target's mm, immediately sealed against munmap/mremap/
> > mprotect/MAP_FIXED-stomp from the target itself and any
> > CLONE_VM peer. The range is recorded on the listener filter
> > for SEND_REDIRECT validation.
> >
>
> I haven't read the code, but I think this at least conceptually makes
> a decent amount of sense. But...
>
> > SECCOMP_IOCTL_NOTIF_SEND_REDIRECT
> >
> > Resumes the trapped syscall (like FLAG_CONTINUE) with
> > arg-register substitution. The supervisor supplies an
> > args_mask (which arg registers to replace), a ptr_mask
> > (which of those are pointers, validated to fall inside an
> > installed pin) and replacement values. The kernel saves
> > the trapped task's original arg registers into a small
> > heap record, writes substituted values via
> > syscall_set_arguments(), and queues a task_work callback
> > that fires at user-mode return after the syscall completes
> > to restore the original registers. This preserves the
> > caller-saved arg-register ABI invariant for callers that
> > expected register contents to survive across the syscall
> > (compilers under LTO, inline-asm syscall wrappers, anything
> > that doesn't strictly follow psABI).
>
> Here there be dragons, and I kind of alluded to some of those dragons
> in my recent message about STRICT, but let's be more thorough.
>
> I'm going to totally ignore the implementation for now (which I think
> has a memory leak, but whatever -- this is solvable, at least in

Yes, let's get the design correct before digging into any detail.

> principle). Conceptually, SEND_REDIRECT is handling a seccomp action
> by doing a syscall that may be different from the originally requested
> syscall. And we have a whole host of potential issues, some related
> to security and some related to functionality.
>
> Let's do the functionality ones first: what happens if a signal
> happens? In the simplest cases (signal completely ignored, task
> killed (there's the memory leak), or -EINTR), I think we're mostly
> okay. But in the case where the syscall needs to restart or, worse,
> use one of the fancy restart techniques, what should happen? I think
> that even defining semantics is somewhat nontrivial, and I'm a bit
> concerned that the user notifier would need to actually be aware of
> signals. Yuck.

Good catch! I completely missed the signal case, how about restoring
at syscall-exit, before signal/restart processing?

>
> Now security: right now we have this rule:
>
> /*
> * All BPF programs must return a 32-bit value.
> * The bottom 16-bits are for optional return data.
> * The upper 16-bits are ordered from least permissive values to most,
> * as a signed value (so 0x8000000 is negative).
> *
> * The ordering ensures that a min_t() over composed return values always
> * selects the least permissive choice.
> */
> #define SECCOMP_RET_KILL_PROCESS 0x80000000U /* kill the process */
> #define SECCOMP_RET_KILL_THREAD 0x00000000U /* kill the thread */
> #define SECCOMP_RET_KILL SECCOMP_RET_KILL_THREAD
> #define SECCOMP_RET_TRAP 0x00030000U /* disallow and force a SIGSYS */
> #define SECCOMP_RET_ERRNO 0x00050000U /* returns an errno */
> #define SECCOMP_RET_USER_NOTIF 0x7fc00000U /* notifies userspace */
> #define SECCOMP_RET_TRACE 0x7ff00000U /* pass to a tracer or disallow */
> #define SECCOMP_RET_LOG 0x7ffc0000U /* allow after logging */
> #define SECCOMP_RET_ALLOW 0x7fff0000U /* allow */
>
> This has always bothered me. In the absence of USER_NOTIF and TRACE,
> fine, I guess -- we're choosing the least permissive, and this doesn't
> seem too crazy. But if we do anything fancy (like this patch series),
> I think this becomes wrong. (And I kind of think I said something
> along these lines many years ago.)
>
> Before this series (in current kernels), one can do syscall emulation
> using SECCOMP_RET_TRACE, and it kind of works as long as no filter in
> the stack tries to block the original syscall *or* the rewritten
> syscall, because syscalls issued by using ptrace to redirect the
> traced process go through seccomp again. It's a total mess, it can't
> handle complex cases, but it's at least approximately secure.
>
> With this series, I think it's all busted. Suppose I make a container
> and block everyone's favorite unshare, generating SECCOMP_RET_ERRNO.
> (This is the default "docker" (actually moby I think) policy.) Then,
> inside the container, I write a program that installs a filter that
> sends syscall 12345 to USER_NOTIF. Then I fork and my child does
> syscall 12345. I handle USER_NOTIF by using the new redirect feature
> to redirect to unshare(). And unshare() gets called.
>
> IMHO what *should* happen is that we actually keep track of where we
> are in the seccomp filter stack. We start from the innermost filter
> (most recently applied) and start running the filters. And then we do
> something that actually makes sense based on the result. For example:
>
> KILL: Kill it. Do not run more filters. (I suppose we could see if
> an outer filter promotes from KILL_THREAD to KILL_PROCESS, but this
> doesn't seem helpful.)
> TRAP: Generate the signal. Do *not* run more filters. Sure, this
> can allow a contained program to generate a SIGSYS instead of getting
> killed if it tries some blocked syscall that the outer filter wants to
> KILL. So what? I actually think this is better behavior -- the
> combination of the program and inner filter is not actually doing the
> syscall.
>
> ERRNO: Same deal -- replace the syscall with a return of the specified
> value. Don't call more filters.
>
> TRACE: Similar.
>
> ALLOW: Call the next filter.
>
> USER_NOTIF: Stop calling filters and remember where we are in the
> filter chain. Call out to the user notifier *associated with this
> filter*. When the user notifier responds, if the notifier asks for a
> redirect or to resume the syscall, then continue calling filters *on
> the new syscall*.
>
> Looking at my example above, the effect would be that the inner filter
> gets a notifier event for syscall 12345 and redirects to unshare.
> Then the outer filter sees unshare. It can ERROR to cause unshare to
> return an error, or it can do its own USER_NOTIF to do something fancy
> with unshare, or it can KILL, etc.
>
>
> This may be enough of a scary departure that we will want each filter
> to opt in to the new behavior for filters applied later. Or maybe
> everyone can get comfortable enough with it to just switch over. Or
> maybe there's another solution. Or maybe someone can try to convince
> me that the existing behavior makes sense if syscalls can be
> redirected (maybe call the whole chain on the redirected syscall? Even
> defining that gets a little messy.)

Thanks for the detailed analysis!

How about keeping min and re-run only the outer suffix after a redirect?

I think this is the safest option. I agree your suggestion of removing min
is more elegant, but it also brings risks of breaking existing filtering logic.

Below is the code sketch to show what I propose here:

static u32 seccomp_run_filters_from(const struct seccomp_data *sd,
struct seccomp_filter *start,
struct seccomp_filter **match)
{
u32 ret = SECCOMP_RET_ALLOW;
for (struct seccomp_filter *f = start; f; f = f->prev)
if (ACTION_ONLY(bpf_prog_run_pin_on_cpu(f->prog, sd)) <
ACTION_ONLY(ret)) {
ret = /*cur*/; *match = f;
}
return ret;
}

/* __seccomp_filter takes a `start` (innermost on the first call) */
case SECCOMP_RET_USER_NOTIF:
if (seccomp_do_user_notification(this_syscall, match, &sd))
goto skip;

/* syscall may be rewritten now; re-vote with the OUTER filters only */
this_syscall = syscall_get_nr(current, current_pt_regs());
if (this_syscall < 0)
return 0;
return __seccomp_filter(this_syscall, match->prev /* start outward */);

Please let me know your preference.

Thanks!
Cong