[PATCH v4 6/6] docs/seccomp: document pinned-memfd redirect ioctls
From: Cong Wang
Date: Fri Jun 26 2026 - 21:23:28 EST
From: Cong Wang <cwang@xxxxxxxxxxxxxx>
Document SECCOMP_IOCTL_NOTIF_PIN_INSTALL and
SECCOMP_IOCTL_NOTIF_SEND_REDIRECT in the userspace API guide: the
SECCOMP_FILTER_FLAG_REDIRECT opt-in and the single-redirector
restriction, the two response structures, and how the pair closes the
user-notification TOCTOU for non-cooperative fork+execve sandboxes.
Also spell out the scope the implementation deliberately enforces or
relies on: read-only input pointers only, same-syscall-number only
(rt_sigreturn is refused), the per-interruption re-notification of
restartable syscalls and the restart-block behaviour, the clone()/fork()
child caveat, and the ptrace syscall-stop semantics.
Assisted-by: Claude:claude-opus-4.8
Signed-off-by: Cong Wang <cwang@xxxxxxxxxxxxxx>
---
.../userspace-api/seccomp_filter.rst | 108 ++++++++++++++++++
1 file changed, 108 insertions(+)
diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
index cff0fa7f3175..3a8ac7775da6 100644
--- a/Documentation/userspace-api/seccomp_filter.rst
+++ b/Documentation/userspace-api/seccomp_filter.rst
@@ -289,6 +289,114 @@ above in this document: all arguments being read from the tracee's memory
should be read into the tracer's memory before any policy decisions are made.
This allows for an atomic decision on syscall arguments.
+Non-cooperative pinned-memfd redirect
+=====================================
+
+The TOCTOU described above means ``SECCOMP_USER_NOTIF_FLAG_CONTINUE`` cannot
+enforce a policy on pointer arguments: after the supervisor inspects the
+target's memory and lets the syscall continue, the target (or a thread sharing
+its address space) can rewrite that memory before the kernel reads it. The
+cooperative workaround, the target ``mmap()`` + ``mseal()``-ing a shared
+buffer, is unavailable in the fork+execve sandbox model, where the supervisor
+confines a binary it did not write.
+
+Two ioctls let the supervisor close this race without target cooperation. The
+redirect step (below) requires a listener created with
+``SECCOMP_FILTER_FLAG_REDIRECT`` (in addition to
+``SECCOMP_FILTER_FLAG_NEW_LISTENER``). Because it rewrites another task's
+registers, at most one such listener may exist in a task's filter chain; a
+second fails with ``-EBUSY``:
+
+.. code-block:: c
+
+ fd = seccomp(SECCOMP_SET_MODE_FILTER,
+ SECCOMP_FILTER_FLAG_NEW_LISTENER | SECCOMP_FILTER_FLAG_REDIRECT,
+ &prog);
+
+``ioctl(SECCOMP_IOCTL_NOTIF_PIN_INSTALL)`` installs a sealed mapping of a
+supervisor-owned ``memfd`` directly into the trapped task's address space:
+
+.. code-block:: c
+
+ struct seccomp_notif_pin_install {
+ __u64 id;
+ __u32 flags; /* reserved, must be 0 */
+ __u32 memfd;
+ __u64 target_addr;
+ __u64 size;
+ __u64 offset; /* page-aligned offset into memfd */
+ };
+
+``id`` names an active notification (the trapped task to install into).
+``target_addr``, ``size`` and ``offset`` are page-aligned; ``offset`` selects
+where in ``memfd`` the mapping starts, so one memfd can back several pins. If
+``target_addr`` is ``0`` the kernel picks a free address and writes it back;
+otherwise an existing mapping there yields ``-EEXIST``. The pin is read-only
+and sealed, the target and its threads cannot unmap, move, reprotect or
+overwrite it, and lasts until the target ``execve()``s or exits.
+
+``memfd`` must be write-sealed (``F_SEAL_WRITE`` or ``F_SEAL_FUTURE_WRITE``)
+or the ioctl returns ``-EINVAL``; otherwise the target could rewrite the pin's
+bytes through a separate writable handle to the same memfd.
+``F_SEAL_FUTURE_WRITE`` still lets the supervisor update the contents through
+its own mapping made before the seal.
+
+``ioctl(SECCOMP_IOCTL_NOTIF_SEND_REDIRECT)`` then resumes the trapped syscall
+like ``SECCOMP_USER_NOTIF_FLAG_CONTINUE``, but with selected argument
+registers replaced:
+
+.. code-block:: c
+
+ struct seccomp_notif_resp_redirect {
+ __u64 id;
+ __u32 flags; /* SECCOMP_REDIRECT_FLAG_CONTINUE must be set */
+ __u32 args_mask; /* which arg registers to replace */
+ __u32 ptr_mask; /* which of those are pointers into a pin */
+ __u32 memfd; /* the pin's backing memfd */
+ __u64 args[6]; /* replacement values */
+ __u64 ptr_len[6]; /* validated access length for each pointer arg */
+ };
+
+Each bit in ``ptr_mask`` (a subset of ``args_mask``) marks ``args[i]`` as a
+pointer; the access ``[args[i], args[i] + ptr_len[i])`` must lie within a
+single read-only pin of ``memfd`` in the target, or the ioctl returns
+``-EFAULT``. ``ptr_len[i]`` must be non-zero for those bits and ``0``
+otherwise. Bits in ``args_mask`` but not ``ptr_mask`` are scalar replacements
+written verbatim, e.g. to set the length register that goes with a redirected
+pointer. The original registers are restored at syscall exit, so the
+substitution is invisible to the target and the TOCTOU is closed.
+
+Scope and limitations
+---------------------
+
+The redirect mechanism is deliberately narrow and is *not* a general syscall
+rewriting facility:
+
+- **Read-only input pointers only.** A pin is read-only, so only an argument
+ the syscall *reads* (a pathname, a ``sockaddr``) may be redirected into it.
+ Aiming an output or in/out argument at a pin makes the syscall fail with
+ ``-EFAULT`` when it writes back.
+
+- **Same syscall only.** A redirect replaces arguments, never the syscall
+ number. ``rt_sigreturn()`` (and its compat variant) cannot be redirected and
+ return ``-EOPNOTSUPP``.
+
+- **Signals and restarts.** The redirected syscall really runs, so it can be
+ interrupted and restarted. On a restart the original arguments are restored
+ and the syscall re-traps, so the supervisor is notified again and must answer
+ consistently. Syscalls the kernel restarts without re-trapping (e.g.
+ ``nanosleep()``, ``futex(FUTEX_WAIT)``) keep the substituted arguments --
+ safe for read-only inputs, but a reason not to redirect arguments of syscalls
+ that block or wait.
+
+- **clone()/fork().** A child keeps the substituted argument registers (the
+ restore is not inherited). Redirect ``clone()``/``fork()`` arguments only if
+ that is acceptable.
+
+- **ptrace.** A tracer sees the substituted arguments at the syscall-exit stop;
+ they are restored before the task resumes, so a ``PTRACE_SETREGS`` of a
+ substituted register at that stop is overwritten.
+
Sysctls
=======
--
2.43.0