[GIT PULL] close_range for v5.9

From: Christian Brauner
Date: Tue Aug 04 2020 - 07:34:17 EST

Hi Linus,

Here's the pull request for close_range() for v5.9.

/* Summary */
This adds the close_range() syscall. It allows to efficiently close a range of
file descriptors up to all file descriptors of a calling task.

This is coordinated with the FreeBSD folks which have copied our version of
this syscall and in the meantime have already merged it in April 2019:

The syscall originally came up in a discussion around the new mount API and
making new file descriptor types cloexec by default. During this discussion, Al
suggested the close_range() syscall.

First, it helps to close all file descriptors of an exec()ing task. This
can be done safely via (quoting Al's example from [1] verbatim):

/* that exec is sensitive */
/* we don't want anything past stderr here */
close_range(3, ~0U);

The code snippet above is one way of working around the problem that file
descriptors are not cloexec by default. This is aggravated by the fact that
we can't just switch them over without massively regressing userspace. For
a whole class of programs having an in-kernel method of closing all file
descriptors is very helpful (e.g. demons, service managers, programming
language standard libraries, container managers etc.).

Second, it allows userspace to avoid implementing closing all file descriptors
by parsing through /proc/<pid>/fd/* and calling close() on each file
descriptor and other hacks. From looking at various large(ish) userspace code
bases this or similar patterns are very common in service managers, container
runtimes, and programming language runtimes/standard libraries such as Python
or Rust. In addition, the syscall will also work for tasks that do not have
procfs mounted and on kernels that do not have procfs support compiled in. In
such situations the only way to make sure that all file descriptors are closed
is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
OPEN_MAX trickery.

Based on Linus suggestion close_range() also comes with a new flag
CLOSE_RANGE_UNSHARE to more elegantly handle file descriptor dropping right
before exec. This would usually be expressed in the sequence:
close_range(3, ~0U);
as pointed out by Linus it might be desirable to have this be a part of
close_range() itself under a new flag CLOSE_RANGE_UNSHARE which gets especially
handy when we're closing all file descriptors above a certain threshold.

Test-suite as always included.

/* Testing */
All patches are based on v5.8-rc1 and have been sitting in linux-next. No build
failures or warnings were observed. All old and new tests are passing.

/* Conflicts */
There was no merge conflict doing a test-merge with current mainline at
bcf876870b95 ("Linux 5.8"). However, I expect that there's a chance of
conflicts as soon as any other syscall lands. close_range() might be lucky
though since it had its syscall number 436 effectively reserved since 2019.

There's a trivial merge conflict with the seccomp tree reported by Stephen at
https://lore.kernel.org/lkml/20200710161806.28d8c856@xxxxxxxxxxxxxxxx . The
relevant seccomp pr is at
Stephen's suggested fix seems correct:

diff --cc fs/file.c
index 4fb111735d1d,340bc9569f9d..000000000000
--- a/fs/file.c
+++ b/fs/file.c
@@@ -18,7 -19,7 +19,8 @@@
#include <linux/bitops.h>
#include <linux/spinlock.h>
#include <linux/rcupdate.h>
+#include <net/sock.h>
+ #include <linux/close_range.h>

unsigned int sysctl_nr_open __read_mostly = 1024*1024;
unsigned int sysctl_nr_open_min = BITS_PER_LONG;

The following changes since commit b3a9e3b9622ae10064826dccb4f7a52bd88c7407:

Linux 5.8-rc1 (2020-06-14 12:45:04 -0700)

are available in the Git repository at:

git@xxxxxxxxxxxxxxxxxxx:pub/scm/linux/kernel/git/brauner/linux tags/close-range-v5.9

for you to fetch changes up to a5161eeef97cb0cdc4de966005926db2f5894af4:

tests: add CLOSE_RANGE_UNSHARE tests (2020-06-17 00:07:38 +0200)

Please consider pulling these changes from the signed close-range-v5.9 tag.



Christian Brauner (5):
open: add close_range()
arch: wire-up close_range()
tests: add close_range() tests
close_range: add CLOSE_RANGE_UNSHARE
tests: add CLOSE_RANGE_UNSHARE tests

arch/alpha/kernel/syscalls/syscall.tbl | 1 +
arch/arm/tools/syscall.tbl | 1 +
arch/arm64/include/asm/unistd32.h | 2 +
arch/ia64/kernel/syscalls/syscall.tbl | 1 +
arch/m68k/kernel/syscalls/syscall.tbl | 1 +
arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
arch/parisc/kernel/syscalls/syscall.tbl | 1 +
arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
arch/s390/kernel/syscalls/syscall.tbl | 1 +
arch/sh/kernel/syscalls/syscall.tbl | 1 +
arch/sparc/kernel/syscalls/syscall.tbl | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
fs/file.c | 121 +++++++++++--
fs/open.c | 17 ++
include/linux/fdtable.h | 6 +-
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/unistd.h | 2 +
include/uapi/linux/close_range.h | 9 +
kernel/fork.c | 11 +-
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/core/.gitignore | 1 +
tools/testing/selftests/core/Makefile | 7 +
tools/testing/selftests/core/close_range_test.c | 227 ++++++++++++++++++++++++
28 files changed, 405 insertions(+), 17 deletions(-)
create mode 100644 include/uapi/linux/close_range.h
create mode 100644 tools/testing/selftests/core/.gitignore
create mode 100644 tools/testing/selftests/core/Makefile
create mode 100644 tools/testing/selftests/core/close_range_test.c