[RFC PATCH 0/3] seccomp: SECCOMP_IOCTL_NOTIF_PIN_ARGS for race-free unotify
From: Cong Wang
Date: Sun May 03 2026 - 21:12:14 EST
From: Cong Wang <cwang@xxxxxxxxxxxxxx>
This RFC introduces SECCOMP_IOCTL_NOTIF_PIN_ARGS, a new ioctl on the
seccomp user-notification listener that lets an unprivileged supervisor
atomically snapshot pointer-arg payloads from a trapped task and bind
those snapshots to the task's resumed syscall body. It closes the
documented TOCTOU race that today makes content-aware policy on
SECCOMP_USER_NOTIF_FLAG_CONTINUE unsafe for unprivileged supervisors.
Posting as RFC because the UAPI shape, the consumption-hook placement,
and the v1 vs v2 cut are all design choices that benefit from review
before a non-RFC submission.
## Motivation
seccomp_unotify(2) lets a supervisor inspect a trapped task's syscall
arguments and either deny, allow, or CONTINUE the syscall. CONTINUE
re-runs the syscall body in the trapped task, which re-fetches every
pointer argument from user memory. A sibling thread or CLONE_VM peer
in the trapped task's address space can mutate that memory between
the supervisor's process_vm_readv() and the kernel's re-read, turning
any policy that examined the argument into a check on already-stale
bytes. The seccomp_unotify(2) man page documents this race explicitly.
There is no race-free workaround for unprivileged supervisors today:
- ptrace and /proc/pid/mem are not available to them;
- process_vm_readv into a userspace buffer doesn't help, because
the kernel will re-read user memory regardless on CONTINUE;
- SECCOMP_IOCTL_NOTIF_ADDFD only solves the fd-substitution case,
not the content-of-pointer-arg case.
The result is that unprivileged seccomp supervisors -- which are the
target audience of seccomp_unotify(2) in the first place -- cannot
implement content-aware allow policies. They can only deny or
unconditionally allow. Anything that depends on the actual contents
of a path, sockaddr, argv, or write buffer is unsafe.
## Concrete user: Sandlock
Sandlock <https://github.com/multikernel/sandlock> is a process-based
unprivileged sandbox for AI agents, built on seccomp_unotify(2). At
sandbox setup the agent process installs a seccomp filter with
SECCOMP_FILTER_FLAG_NEW_LISTENER and hands the listener fd to a
Sandlock supervisor; the supervisor then drives each filtered
syscall via SECCOMP_IOCTL_NOTIF_RECV and replies with either an
injected errno or SECCOMP_USER_NOTIF_FLAG_CONTINUE. This is what
lets Sandlock confine an AI-agent process (coding agents, tool-using
agents) to the filesystem and network surface its operator
authorized, without root, a container runtime, or virtualization.
AI-agent workloads routinely fork helpers, exec compilers and
language runtimes, and follow user- or model-controlled paths, so
every content-aware policy decision the supervisor makes depends on
the contents of syscall pointer arguments -- "allow open(path) iff
path is below /allowed/", "allow bind(addr) iff addr->sun_path is in
the permitlist", "allow execve(filename, argv) iff argv[0] is one of
N approved binaries". Each of these is implemented as a CONTINUE
reply, and CONTINUE re-fetches the pointer args from the trapped
task's user memory -- which is where the documented TOCTOU race
bites.
Today Sandlock has to either run privileged (defeating the
unprivileged-sandbox model) or accept the TOCTOU race (defeating the
guarantee against an adversarial agent that spawns a sibling thread
to mutate the buffer between the supervisor's inspection and the
kernel's re-read on CONTINUE). PIN_ARGS gives it the missing
primitive without either compromise.
This is a general primitive, not a Sandlock-specific feature, but
having a concrete user with a real and growing threat model (LLM
agents executing untrusted or partially-trusted code in-process)
should help anchor the design discussion.
## Approach
PIN_ARGS is a per-notification ioctl. The supervisor describes which
register slots to snapshot and what shape each one is. The kernel
walks the trapped task's mm under the existing remote-mm primitives
(access_remote_vm, copy_remote_vm_str), copies the bytes into kernel-
owned buffers, and stamps the snapshot onto the trapped task. On
SECCOMP_USER_NOTIF_FLAG_CONTINUE_PINNED, the kernel's syscall fetch
points consume from the kernel buffer instead of re-reading user
memory.
Three v1 shapes:
- SECCOMP_PIN_FIXED (sockaddr, single-buffer read/write)
- SECCOMP_PIN_CSTRING (paths)
- SECCOMP_PIN_CSTRING_ARRAY (argv, envp)
Each per-arg copy is bounded by max_bytes; total cumulative bytes per
request are bounded at a hardcoded 1 MiB. Allocations use
GFP_KERNEL_ACCOUNT so the trapped task's memcg pays the cost.
The pin is one-shot: cleared on the trapped task's next return-to-
userspace via task_work, with fallback paths for task exit, listener
release, and explicit discard (CONTINUE without CONTINUE_PINNED).
The syscall number is captured at pin time and verified at
consumption, so a signal-handler-issued syscall during -ERESTART*
resolution will not consume the pin.
Pin orchestration uses a three-phase lock dance: validate the notif
and snapshot register args under filter->notify_lock, walk the
trapped task's mm without locks, then re-validate and attach the
snapshot. The walker uses primitives the kernel already uses for
arg fetch (copy_remote_vm_str, getname_kernel, copy_string_kernel,
iov_iter_kvec), so consumption sites are minimally invasive.
## Why copy and not page-pinning
Page-level FOLL_PIN doesn't solve content TOCTOU: the trapped task
(or its CLONE_VM peer) is the owner of the mm and can write through
the same mapping. There is no kernel primitive for "freeze the
contents of these user pages." Copying at decision time is the only
way to guarantee the bytes the supervisor inspected equal the bytes
the kernel acts on.
The kernel already does this copy in syscall bodies today --
getname(), copy_strings(), move_addr_to_kernel(), copy_from_iter()
for ITER_UBUF. PIN_ARGS shifts when that copy happens (at supervisor
decision time) and re-points the syscall fetch points at the
snapshot. Net new copies per syscall: zero.
## Why unprivileged
PIN_ARGS is gated by listener-fd possession, which is itself a
capability scoped by file-descriptor ownership and SCM_RIGHTS
passing. The supervisor already has equivalent remote-mm read access
via process_vm_readv() (subject to the same ptrace_may_access
checks). NO_NEW_PRIVS, required for unprivileged seccomp filter
installation, blocks the obvious execve escalation. The DoS surface
is bounded by the 1 MiB per-request cap, the one-shot lifetime, and
at-most-one-pin-per-trapped-task, with memcg accounting on top.
Requiring CAP_SYS_PTRACE would render PIN_ARGS useless for its only
real audience; privileged supervisors already have ptrace and
/proc/pid/mem.
## What's NOT covered in v1
- Vector I/O (readv/writev) -- needs per-iovec pin descriptors,
intentional v2.
- Nested-pointer payloads (sendmsg msghdr, futex_waitv) -- same.
- Per-iter consumption hooks beyond getname_flags,
move_addr_to_kernel, copy_strings, and import_ubuf. Other syscall
fetch sites that re-read user memory still race; v1 covers the
four most common cases (path, sockaddr, argv/envp, single-buffer
read/write) which together cover the bulk of practical
unprivileged-sandbox policies.
## Patches
[PATCH 1/3] seccomp: kernel implementation
(UAPI, walker, orchestrator, four consumption hooks,
one-shot lifecycle)
[PATCH 2/3] selftests/seccomp: end-to-end coverage
(10 cases across all three shapes + lifecycle)
[PATCH 3/3] Documentation: seccomp_filter.rst
("Pinned arguments" section)
## Testing
The selftest binary covers all three v1 shapes against real syscalls
(bind, openat, execve, write), plus negative paths (CONTINUE without
PINNED, double pin, mismatched flags) and the lifecycle (post-
syscall clear, SIGKILL teardown). All ten cases pass on x86_64.
Cong Wang (3):
seccomp: add SECCOMP_IOCTL_NOTIF_PIN_ARGS to close the unotify TOCTOU
race
selftests/seccomp: add seccomp_pin_args end-to-end coverage
Documentation: seccomp: document SECCOMP_IOCTL_NOTIF_PIN_ARGS
.../userspace-api/seccomp_filter.rst | 76 ++
MAINTAINERS | 2 +
fs/exec.c | 63 ++
fs/namei.c | 19 +
fs/read_write.c | 8 +-
include/linux/mm.h | 2 +-
include/linux/seccomp.h | 35 +
include/linux/seccomp_types.h | 33 +
include/uapi/linux/seccomp.h | 73 ++
kernel/Makefile | 1 +
kernel/exit.c | 1 +
kernel/fork.c | 5 +
kernel/seccomp.c | 189 +++-
kernel/seccomp_pin.c | 453 +++++++++
kernel/seccomp_pin.h | 109 +++
lib/iov_iter.c | 22 +
mm/memory.c | 4 +-
mm/nommu.c | 4 +-
net/socket.c | 16 +
tools/testing/selftests/seccomp/.gitignore | 1 +
tools/testing/selftests/seccomp/Makefile | 2 +-
.../selftests/seccomp/seccomp_pin_args.c | 857 ++++++++++++++++++
22 files changed, 1961 insertions(+), 14 deletions(-)
create mode 100644 kernel/seccomp_pin.c
create mode 100644 kernel/seccomp_pin.h
create mode 100644 tools/testing/selftests/seccomp/seccomp_pin_args.c
--
2.43.0