[PATCH v3 0/4] Add seccomp notifier ioctl that enables adding fds

From: Sargun Dhillon
Date: Tue Jun 02 2020 - 21:13:38 EST

This adds the capability for seccomp notifier listeners to add file
descriptors in response to a seccomp notification. This is useful for
syscalls in which the previous capabilities were not sufficient. The
current mechanism works well for syscalls that either have side effects
that are system / namespace wide (mount), or that operate on a specific
set of registers (reboot, mknod), and don't require dereferencing pointers.
The problem with derefencing pointers in a supervisor is that it leaves
us vulnerable to TOC-TOU [1] style attacks. For syscalls that had a direct
effect on file descriptors pidfd_getfd was added, allowing for those file
descriptors to be directly operated upon by the supervisor [2].

Unfortunately, this leaves system calls which return file descriptors
out of the picture. These are fairly common syscalls, such as openat,
socket, and perf_event_open that return file descriptors, and have
arguments that are pointers. These require that the supervisor is able to
verify the arguments, make the call on behalf of the process on hand,
and pass back the resulting file descriptor. This is where addfd comes
into play.

There is an additional flag that allows you to "set" an FD, rather than
add it with an arbitrary number. This has dup2 style semantics, and
installs the new file at that file descriptor, and atomically closes
the old one if it existed. This is useful for a particular use case
that we have, in which we want to swap out AF_INET sockets for AF_UNIX,
AF_INET6, and sockets in another namespace when doing "upconversion".

My specific usecase at Netflix is to enable our IPv4-IPv6 transition
mechanism, in which we our namespaces have no real IPv4 reachability,
and when it comes time to do a connect(2), we get a socket from a
namespace with global IPv4 reachability.

In addition, we intend to use it for our servicemesh, and where our
service mesh needs to intercept traffic ingress traffic, the addfd
capability will act as a mechanism to do socket activation.

Addfd is not implemented as a separate syscall, a la pidfd_getfd, as
VFS makes some optimizations in regards to the fdtable, and assumes
that they are not modified by external processes. Although a mechanism
that scheduled something in the context of the task could work, it is
somewhat simpler to do it in the context of the ioctl as we control
the task while in kernel. In addition there are not obvious needs
for this beyond seccomp notifier.

This mechanism leaves a potential issue that if the manager is
interrupted while injecting FDs, the child process will be left with
leaked / dangling FDs. This may lead to undefined behaviour. A
mechanism to work around this is to extend the structure and add a
"rollback" mechanism for FDs to be closed if things fail.

This introduces a new helper -- file_receive, which is responsible
for moving fds across processes. The helper replaces code in
SCM_RIGHTS. In SCM_RIGHTS compat codepath there was a bug that
resulted in this not being set all. This fixes that bug, and should
be cherry-picked into long-term. The file_receive change should
probably go into stable. The file_receive code also replaced the
receive fd logic in pidfd_getfd. This is somewhat contrary to my
original view[5], but I think it is best for the principal of
least surprise to adopt it. This should be cherry-picked into stable.

I tested this on amd64 with the x86-64 and x32 ABIs.

Given there is no testing infrastructure for cgroup v1, I opted to
forgo adding new tests there as it is considered deprecated.

Changes since v2:
* Introducion of the file_receive helper which hoists out logic to
manipulate file descriptors outside of seccomp.c to file.c
* Small fix that manipulated the socket's cgroup even when the
receive failed
* seccomp struct layout
Changes since v1:
* find_notification has been cleaned up slightly, and it replaces a use
case in send as well.
* Fixes ref counting rules to get / release references in the ioctl side,
rather than the seccomp notifier side [3].
* Removes the optional move flag, and opts into SCM_RIGHTS
* Rearranges the seccomp_notif_addfd datastructure for greater user
clarity [4]. In order to avoid unnamed padding it makes size u64,
which is a little bit of a waste of space.
* Changes error codes to return ESRCH upon the process going away on
notification, and EINPROGRESS is the notification is in an unexpected
state (and added tests for this behaviour)

[1]: https://lore.kernel.org/lkml/20190918084833.9369-2-christian.brauner@xxxxxxxxxx/
[2]: https://lore.kernel.org/lkml/20200107175927.4558-1-sargun@xxxxxxxxx/
[3]: https://lore.kernel.org/lkml/20200525000537.GB23230@xxxxxxxxxxxxxxxxxx/
[4]: https://lore.kernel.org/lkml/20200525135036.vp2nmmx42y7dfznf@wittgenstein/
[5]: https://lore.kernel.org/lkml/20200107175927.4558-1-sargun@xxxxxxxxx/

Sargun Dhillon (4):
fs, net: Standardize on file_receive helper to move fds across
pid: Use file_receive helper to copy FDs
seccomp: Introduce addfd ioctl to seccomp user notifier
selftests/seccomp: Test SECCOMP_IOCTL_NOTIF_ADDFD

fs/file.c | 56 ++++++
include/linux/file.h | 2 +
include/uapi/linux/seccomp.h | 25 +++
kernel/pid.c | 20 +-
kernel/seccomp.c | 184 +++++++++++++++++-
net/compat.c | 10 +-
net/core/scm.c | 14 +-
tools/testing/selftests/seccomp/seccomp_bpf.c | 183 +++++++++++++++++
8 files changed, 467 insertions(+), 27 deletions(-)