[RFC PATCH 00/21] pipe: Keyrings, Block and USB notifications

From: David Howells
Date: Tue Oct 15 2019 - 17:47:56 EST



Here's a set of patches to add a general notification queue concept and to
add event sources such as:

(1) Keys/keyrings, such as linking and unlinking keys and changing their
attributes.

(2) General device events (single common queue) including:

- Block layer events, such as device errors

- USB subsystem events, such as device attach/remove, device reset,
device errors.

I have patches for adding superblock and mount topology watches also,
though those are not in this set as there are other dependencies.

LSM hooks are included:

(1) A set of hooks are provided that allow an LSM to rule on whether or
not a watch may be set. Each of these hooks takes a different
"watched object" parameter, so they're not really shareable. The LSM
should use current's credentials. [Wanted by SELinux & Smack]

(2) A hook is provided to allow an LSM to rule on whether or not a
particular message may be posted to a particular queue. This is given
the credentials from the event generator (which may be the system) and
the watch setter. [Wanted by Smack]

I've provided SELinux and Smack with implementations of some of these hooks.


Design decisions:

(1) The notification queue is built on top of a standard pipe. Messages are
effectively spliced in.

pipe2(fds, O_TMPFILE); // Note that O_TMPFILE is just hacked in atm

which is then configured::

ioctl(fds[1], IOC_WATCH_QUEUE_SET_SIZE, queue_depth);
ioctl(fds[1], IOC_WATCH_QUEUE_SET_FILTER, &filter);

Messages are then read out of the pipe using read().

(2) It should be possible to allow write() to insert data into the
notification pipes too, but this is currently disabled as the kernel has
to be able to insert messages into the pipe *without* holding pipe->mutex
and the code to make this work needs careful auditing.

(3) sendfile(), splice() and vmsplice() are disabled on notification pipes
because of the pipe->mutex issue and also because they sometimes want to
revert what they just did - but one or more notification messages
might've been interleaved in the ring.

(4) The kernel inserts messages with the wait queue spinlock held. This
means that pipe_read() and pipe_write() have to take the spinlock to
update the queue pointers.

(5) Records in the buffer are binary, typed and have a length so that they
can be of varying size.

This allows multiple heterogeneous sources to share a common buffer;
there are 16 million types available, of which I've used just a few,
so there is scope for others to be used. Tags may be specified when a
watchpoint is created to help distinguish the sources.

(6) Records are filterable as types have up to 256 subtypes that can be
individually filtered. Other filtration is also available.

(7) Notification pipes don't interfere with each other; each may be bound to
a different set of watches. Any particular notification will be copied
to all the queues that are currently watching for it - and only those
that are watching for it.

(8) When recording a notification, the kernel will not sleep, but will rather
mark a queue as having lost a message if there's insufficient space.
read() will fabricate a loss notification message at an appropriate point
later.

(9) The notification pipe is created and then watchpoints are attached to it,
using one of:

keyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fds[1], 0x01);
watch_devices(fds[1], 0x02, 0);

where in both cases, fd indicates the queue and the number after is a
tag between 0 and 255.

(10) Watches are removed if either the notification pipe is destroyed or the
watched object is destroyed. In the latter case, a message will be
generated indicating the enforced watch removal.


Things I want to avoid:

(1) Introducing features that make the core VFS dependent on the network
stack or networking namespaces (ie. usage of netlink).

(2) Dumping all this stuff into dmesg and having a daemon that sits there
parsing the output and distributing it as this then puts the
responsibility for security into userspace and makes handling
namespaces tricky. Further, dmesg might not exist or might be
inaccessible inside a container.

(3) Letting users see events they shouldn't be able to see.


Benchmarks:

PATCHES BENCHMARK BEST TOTAL BYTES AVG BYTES STDDEV
======= =============== =============== =============== =============== ===============
0 pipe 305798752 36324188187 302701568 6128384
0 vmsplice 432261032 49377769035 411481408 60638553
0 splice 284943309 27769149338 231409577 171374033

prelock pipe 307500893 36470120589 303917671 9970065
prelock vmsplice 437991897 51502859114 429190492 22841773
prelock splice 285469128 28158931451 234657762 138350254

ht pipe 304526362 35934813852 299456782 14371537
ht vmsplice 437819690 51481865190 429015543 25713854
ht splice 242954439 23296773094 194139775 148906490

r pipe 306166161 36117069156 300975576 11493141
r vmsplice 335931911 39837793527 331981612 9098195
r splice 282334654 27472565449 228938045 219399557

rx pipe 304664843 36059231110 300493592 9800700
rx vmsplice 446602247 52216984680 435141539 31456017
rx splice 277978890 26516522263 220971018 199170870

rxw pipe 304180415 36317294286 302644119 5967557
rxw vmsplice 447361082 51238998643 426991655 62191270
rxw splice 277213923 27627387718 230228230 173675370

rxwx pipe 305390974 36110185475 300918212 11895583
rxwx vmsplice 445372781 50668744616 422239538 63133543
rxwx splice 280218603 27677264384 230643869 196958675

rxwxf pipe 305001592 36325565843 302713048 7551300
rxwxf vmsplice 444316544 50213093070 418442442 77117994
rxwxf splice 278371338 28859765396 240498044 160245812

all pipe 298858861 35543407781 296195064 5173617
all vmsplice 453414388 53792295991 448269133 10914547
all splice 279264055 26812726990 223439391 212421416

The patches column indicates the point in the patchset at which the benchmarks
were taken:

0 No patches
prelock "Add a prelocked wake-up"
hr "pipe: Use head and tail pointers for the ring, not cursor and length"
r "pipe: Advance tail pointer inside of wait spinlock in pipe_read()"
rx "pipe: Conditionalise wakeup in pipe_read()"
rxw "pipe: Rearrange sequence in pipe_write() to preallocate slot"
rxwx "pipe: Remove redundant wakeup from pipe_write()"
rxwxf "pipe: Check for ring full inside of the spinlock in pipe_write()"
all All of the patches

The benchmark programs can be found here:

http://people.redhat.com/dhowells/pipe-bench.c
http://people.redhat.com/dhowells/vmsplice-bench.c
http://people.redhat.com/dhowells/splice-bench.c

The patches can be found here also:

http://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=pipe-experimental

Changes:

ver #1:

(*) Build on top of standard pipes instead of having a driver.

David
---
David Howells (21):
pipe: Reduce #inclusion of pipe_fs_i.h
Add a prelocked wake-up
pipe: Use head and tail pointers for the ring, not cursor and length
pipe: Advance tail pointer inside of wait spinlock in pipe_read()
pipe: Conditionalise wakeup in pipe_read()
pipe: Rearrange sequence in pipe_write() to preallocate slot
pipe: Remove redundant wakeup from pipe_write()
pipe: Check for ring full inside of the spinlock in pipe_write()
uapi: General notification queue definitions
security: Add hooks to rule on setting a watch
security: Add a hook for the point of notification insertion
pipe: Add general notification queue support
keys: Add a notification facility
Add sample notification program
pipe: Allow buffers to be marked read-whole-or-error for notifications
pipe: Add notification lossage handling
Add a general, global device notification watch list
block: Add block layer notifications
usb: Add USB subsystem notifications
selinux: Implement the watch_key security hook
smack: Implement the watch_key and post_notification hooks


Documentation/ioctl/ioctl-number.rst | 1
Documentation/security/keys/core.rst | 58 ++
Documentation/watch_queue.rst | 385 ++++++++++++++++
arch/alpha/kernel/syscalls/syscall.tbl | 1
arch/arm/tools/syscall.tbl | 1
arch/arm64/include/asm/unistd.h | 2
arch/arm64/include/asm/unistd32.h | 2
arch/ia64/kernel/syscalls/syscall.tbl | 1
arch/m68k/kernel/syscalls/syscall.tbl | 1
arch/microblaze/kernel/syscalls/syscall.tbl | 1
arch/mips/kernel/syscalls/syscall_n32.tbl | 1
arch/mips/kernel/syscalls/syscall_n64.tbl | 1
arch/mips/kernel/syscalls/syscall_o32.tbl | 1
arch/parisc/kernel/syscalls/syscall.tbl | 1
arch/powerpc/kernel/syscalls/syscall.tbl | 1
arch/s390/kernel/syscalls/syscall.tbl | 1
arch/sh/kernel/syscalls/syscall.tbl | 1
arch/sparc/kernel/syscalls/syscall.tbl | 1
arch/x86/entry/syscalls/syscall_32.tbl | 1
arch/x86/entry/syscalls/syscall_64.tbl | 1
arch/xtensa/kernel/syscalls/syscall.tbl | 1
block/Kconfig | 9
block/blk-core.c | 29 +
drivers/base/Kconfig | 9
drivers/base/Makefile | 1
drivers/base/watch.c | 90 ++++
drivers/usb/core/Kconfig | 9
drivers/usb/core/devio.c | 47 ++
drivers/usb/core/hub.c | 4
fs/exec.c | 1
fs/fuse/dev.c | 31 +
fs/ocfs2/aops.c | 1
fs/pipe.c | 392 +++++++++++-----
fs/splice.c | 195 +++++---
include/linux/blkdev.h | 15 +
include/linux/device.h | 7
include/linux/key.h | 3
include/linux/lsm_audit.h | 1
include/linux/lsm_hooks.h | 38 ++
include/linux/pipe_fs_i.h | 63 ++-
include/linux/security.h | 32 +
include/linux/syscalls.h | 1
include/linux/uio.h | 4
include/linux/usb.h | 18 +
include/linux/wait.h | 2
include/linux/watch_queue.h | 127 +++++
include/uapi/asm-generic/unistd.h | 4
include/uapi/linux/keyctl.h | 2
include/uapi/linux/watch_queue.h | 155 ++++++
init/Kconfig | 12
kernel/Makefile | 1
kernel/sched/wait.c | 7
kernel/sys_ni.c | 1
kernel/watch_queue.c | 660 +++++++++++++++++++++++++++
lib/iov_iter.c | 274 +++++++----
samples/Kconfig | 7
samples/Makefile | 1
samples/watch_queue/Makefile | 7
samples/watch_queue/watch_test.c | 252 ++++++++++
security/keys/Kconfig | 9
security/keys/compat.c | 3
security/keys/gc.c | 5
security/keys/internal.h | 30 +
security/keys/key.c | 38 +-
security/keys/keyctl.c | 99 ++++
security/keys/keyring.c | 20 +
security/keys/request_key.c | 4
security/security.c | 23 +
security/selinux/hooks.c | 14 +
security/smack/smack_lsm.c | 83 +++
70 files changed, 2927 insertions(+), 377 deletions(-)
create mode 100644 Documentation/watch_queue.rst
create mode 100644 drivers/base/watch.c
create mode 100644 include/linux/watch_queue.h
create mode 100644 include/uapi/linux/watch_queue.h
create mode 100644 kernel/watch_queue.c
create mode 100644 samples/watch_queue/Makefile
create mode 100644 samples/watch_queue/watch_test.c