[RFC PATCH v3 0/3] seccomp: non-cooperative pinned-memfd argument redirect

From: Cong Wang

Date: Fri Jun 12 2026 - 20:17:18 EST


The seccomp user-notification SECCOMP_USER_NOTIF_FLAG_CONTINUE response
carries an inherent TOCTOU: once the supervisor decides to let a syscall
continue, the target (or a CLONE_VM peer) can rewrite the memory behind a
pointer argument before the kernel reads it. This is documented in the
UAPI header and is why the notifier "cannot be used to implement a
security policy" today.

The cooperative way around this is for the target to map a shared memfd
and mseal() it during a trusted setup window, so the supervisor can hand
the kernel an immutable buffer. That window does not exist for the common
fork()+execve() sandbox model, where the supervisor wants to confine an
uncooperative (or legacy) binary it did not write.

This series lets the supervisor close the TOCTOU without any target-side
cooperation:

- The kernel installs a sealed, read-only, MAP_SHARED mapping of a
supervisor-owned memfd directly into the trapped task's mm
(SECCOMP_IOCTL_NOTIF_PIN_INSTALL). The mapping is VM_SEALED at
creation, so neither the target nor a CLONE_VM peer can unmap,
remap, mprotect or MAP_FIXED-stomp it. The supervisor writes the
intended argument data through its own mapping of the same memfd.

- The supervisor then resumes the syscall with selected argument
registers rewritten to point into that pin
(SECCOMP_IOCTL_NOTIF_SEND_REDIRECT). Pointer substitutions are
validated so the whole access [ptr, ptr+len) lies inside a pin that
still lives in the target's current mm; original registers are
restored at syscall exit for ABI compliance.

Because the data the kernel acts on lives in an immutable pin, the
target can no longer win the race. execve() is handled as a first-class
case: its pathname is copied from the pin before the old mm is torn
down, and the register-restore is skipped once the program image has
been replaced (detected via self_exec_id).

Patch 1 adds the mm plumbing: __do_mmap(), a variant of do_mmap() that
targets a caller-supplied mm (do_mmap() stays a current->mm wrapper, so
no existing caller changes), and vm_mmap_seal_remote(), a tailored
high-level helper for installing the sealed pin. Patch 2 is the seccomp
ABI and implementation. Patch 3 adds selftests.

Changes since v2:
v3 is a redesign rather than an incremental revision. v2 added a
SECCOMP_IOCTL_NOTIF_INJECT ioctl: the supervisor described a
substitute syscall plus an input buffer, and on CONTINUE the kernel
copied that buffer in and ran a kernel-side helper for a small
whitelist of syscalls (openat, bind, write) without re-reading the
target's memory. That closed the TOCTOU, but required an in-kernel
reimplementation of every supported syscall and a fixed whitelist,
and never actually ran the real syscall.

v3 drops the kernel-side helpers entirely as suggested by Andy.

All four pinned-memfd selftests pass.

---
Cong Wang (3):
mm: add __do_mmap() and vm_mmap_seal_remote()
seccomp: add kernel-installed pinned-memfd redirect
selftests/seccomp: cover non-cooperative pinned-memfd install

include/linux/mm.h | 2 +
include/linux/seccomp.h | 8 +
include/uapi/linux/seccomp.h | 99 ++
kernel/seccomp.c | 366 +++++++
mm/internal.h | 5 +
mm/mmap.c | 29 +-
mm/nommu.c | 12 +-
mm/util.c | 50 +
mm/vma.c | 18 +-
mm/vma.h | 6 +-
tools/testing/selftests/seccomp/seccomp_bpf.c | 960 ++++++++++++++++++
11 files changed, 1533 insertions(+), 22 deletions(-)


base-commit: 28608283615e5e7e92ea79c8ea13507f4b5e0cbe
--
2.43.0