[PATCH v4 0/6] seccomp: non-cooperative pinned-memfd argument redirect

From: Cong Wang

Date: Fri Jun 26 2026 - 21:22:57 EST


The seccomp user-notification SECCOMP_USER_NOTIF_FLAG_CONTINUE response
carries an inherent TOCTOU: once the supervisor decides to let a syscall
continue, the target (or a CLONE_VM peer) can rewrite the memory behind a
pointer argument before the kernel reads it. This is documented in the
UAPI header and is why the notifier "cannot be used to implement a
security policy" today.

The cooperative way around this is for the target to map a shared memfd
and mseal() it during a trusted setup window, so the supervisor can hand
the kernel an immutable buffer. That window does not exist for the common
fork()+execve() sandbox model, where the supervisor wants to confine an
uncooperative (or legacy) binary it did not write.

This series lets the supervisor close the TOCTOU without any target-side
cooperation:

- The kernel installs a sealed, read-only, MAP_SHARED mapping of a
supervisor-owned memfd directly into the trapped task's mm
(SECCOMP_IOCTL_NOTIF_PIN_INSTALL). The mapping is VM_SEALED at
creation, so neither the target nor a CLONE_VM peer can unmap,
remap, mprotect or MAP_FIXED-stomp it. The backing memfd must be
write-sealed (F_SEAL_WRITE / F_SEAL_FUTURE_WRITE), so its bytes
cannot be rewritten through any other reference either; the
supervisor stages the argument data through its own pre-seal mapping.

- The supervisor then resumes the syscall with selected argument
registers rewritten (SECCOMP_IOCTL_NOTIF_SEND_REDIRECT), the pointer
ones aimed into a pin. Each pointer substitution is validated so the
whole access [ptr, ptr+len) lies inside a sealed, read-only pin of
the supervisor's memfd that still lives in the target's current mm;
original registers are restored at syscall exit for ABI compliance.

Because the data the kernel acts on lives in an immutable pin, the
target can no longer win the race. execve() is handled as a first-class
case: its pathname is copied from the pin before the old mm is torn
down, and the register-restore is skipped once the program image has
been replaced (detected via self_exec_id).

A redirected syscall is re-validated against the outer filters in the
target's filter chain, so an inner notifier cannot use a redirect to
smuggle a syscall past a policy an outer filter enforces (e.g. redirect
to a blocked unshare()); see patch 4.

sandlock [1], a non-cooperative seccomp sandbox supervisor, will use
this to enforce argument-level policy on uncooperative targets.

[1] https://github.com/multikernel/sandlock

Patch 1 adds the mm plumbing: __do_mmap(), a variant of do_mmap() that
targets a caller-supplied mm (do_mmap() stays a current->mm wrapper, so
no existing caller changes), and vm_mmap_seal_remote(), a high-level
helper for installing the sealed pin. Patch 2 adds PIN_INSTALL, patch 3
adds SEND_REDIRECT, patch 4 adds the outer-filter re-validation, patch 5
adds selftests, and patch 6 documents the ABI.

Changes since v3:
- Split the single seccomp patch into PIN_INSTALL (patch 2) and
SEND_REDIRECT (patch 3) for reviewability.
- New patch 4: re-validate a redirected syscall against the outer
filters in the stack, closing the bypass Andy described (an inner
notifier redirecting to a syscall an outer filter blocks, e.g.
unshare()).
- Signals: the argument restore now runs before signal/restart
processing. It is queued as task_work with TWA_RESUME -- not the
TWA_SIGNAL discussed on-list, which makes signal_pending() true for
the whole redirected syscall and livelocks an interruptible one.
TWA_RESUME still runs the restore at the top of get_signal(), before
the signal frame is built and before any -ERESTART* rewind; on a
restart the syscall re-traps seccomp and the supervisor is notified
again. rt_sigreturn is refused (-EOPNOTSUPP).
- At most one redirect-capable notifier may exist in a filter chain
(-EBUSY); ordinary notifiers are unconstrained. Syscalls with
complex signal/restart behaviour (nanosleep, futex(FUTEX_WAIT), ...)
are out of scope and should not have their arguments redirected.
- PIN_INSTALL: target_addr == 0 lets the kernel pick a free address in
the target mm (avoids a racy userspace /proc/<pid>/maps scan), and a
new offset field lets one memfd back several disjoint pins.
- New patch 6: Documentation/userspace-api/seccomp_filter.rst.
- Selftests expanded: install into a fresh post-execve mm, stateless
churn, outer-filter re-validation, ABI/versioning, and a
signal-ordering regression test.

Changes since v2:
v3 was a redesign that dropped the v2 SECCOMP_IOCTL_NOTIF_INJECT
approach (an in-kernel reimplementation of a syscall whitelist) in
favour of redirecting the real syscall into a sealed pin, as suggested
by Andy.

Changes since v1:
v2 was a redesign that dropped the v1 SECCOMP_IOCTL_NOTIF_PIN_ARGS

All pinned-memfd and redirect selftests pass.

---
Cong Wang (6):
mm: add __do_mmap() and vm_mmap_seal_remote()
seccomp: introduce SECCOMP_IOCTL_NOTIF_PIN_INSTALL
seccomp: add kernel-installed pinned-memfd redirect
seccomp: re-validate a redirected syscall against outer filters
selftests/seccomp: cover non-cooperative pinned-memfd install
docs/seccomp: document pinned-memfd redirect ioctls

.../userspace-api/seccomp_filter.rst | 108 ++
include/linux/mm.h | 3 +
include/linux/seccomp.h | 12 +-
include/uapi/linux/seccomp.h | 126 ++
kernel/seccomp.c | 446 ++++++-
mm/internal.h | 8 +
mm/mmap.c | 63 +-
mm/nommu.c | 12 +-
mm/util.c | 62 +
mm/vma.c | 35 +-
mm/vma.h | 6 +-
tools/testing/selftests/seccomp/seccomp_bpf.c | 1027 +++++++++++++++++
12 files changed, 1874 insertions(+), 34 deletions(-)


base-commit: ab9de95c9cf952332ab79453b4b5d1bfca8e514f
--
2.43.0