Re: [RFC][PATCH 00/13] Mount, FS, Block and Keyrings notifications [ver #4]
From: Casey Schaufler
Date: Mon Jun 10 2019 - 12:38:31 EST
On 6/10/2019 8:21 AM, Stephen Smalley wrote:
> On 6/7/19 10:17 AM, David Howells wrote:
>>
>> Hi Al,
>>
>> Here's a set of patches to add a general variable-length notification queue
>> concept and to add sources of events for:
>>
>> Â (1) Mount topology events, such as mounting, unmounting, mount expiry,
>> ÂÂÂÂÂ mount reconfiguration.
>>
>> Â (2) Superblock events, such as R/W<->R/O changes, quota overrun and I/O
>> ÂÂÂÂÂ errors (not complete yet).
>>
>> Â (3) Key/keyring events, such as creating, linking and removal of keys.
>>
>> Â (4) General device events (single common queue) including:
>>
>> ÂÂÂÂÂ - Block layer events, such as device errors
>>
>> ÂÂÂÂÂ - USB subsystem events, such as device/bus attach/remove, device
>> ÂÂÂÂÂÂÂ reset, device errors.
>>
>> One of the reasons for this is so that we can remove the issue of processes
>> having to repeatedly and regularly scan /proc/mounts, which has proven to
>> be a system performance problem. To further aid this, the fsinfo() syscall
>> on which this patch series depends, provides a way to access superblock and
>> mount information in binary form without the need to parse /proc/mounts.
>>
>>
>> LSM support is included, but controversial:
>>
>> Â (1) The creds of the process that did the fput() that reduced the refcount
>> ÂÂÂÂÂ to zero are cached in the file struct.
>>
>> Â (2) __fput() overrides the current creds with the creds from (1) whilst
>> ÂÂÂÂÂ doing the cleanup, thereby making sure that the creds seen by the
>> ÂÂÂÂÂ destruction notification generated by mntput() appears to come from
>> ÂÂÂÂÂ the last fputter.
>>
>> Â (3) security_post_notification() is called for each queue that we might
>> ÂÂÂÂÂ want to post a notification into, thereby allowing the LSM to prevent
>> ÂÂÂÂÂ covert communications.
>>
>> Â (?) Do I need to add security_set_watch(), say, to rule on whether a watch
>>  may be set in the first place? I might need to add a variant per
>> ÂÂÂÂÂ watch-type.
>>
>> Â (?) Do I really need to keep track of the process creds in which an
>>  implicit object destruction happened? For example, imagine you create
>>  an fd with fsopen()/fsmount(). It is marked to dissolve the mount it
>>  refers to on close unless move_mount() clears that flag. Now, imagine
>> ÂÂÂÂÂ someone looking at that fd through procfs at the same time as you exit
>>  due to an error. The LSM sees the destruction notification come from
>> ÂÂÂÂÂ the looker if they happen to do their fput() after yours.
>
> I remain unconvinced that (1), (2), (3), and the final (?) above are a good idea.
>
> For SELinux, I would expect that one would implement a collection of per watch-type WATCH permission checks on the target object (or to some well-defined object label like the kernel SID if there is no object) that allow receipt of all notifications of that watch-type for objects related to the target object, where "related to" is defined per watch-type.
>
> I wouldn't expect SELinux to implement security_post_notification() at all. I can't see how one can construct a meaningful, stable policy for it. I'd argue that the triggering process is not posting the notification; the kernel is posting the notification and the watcher has been authorized to receive it.
I cannot agree. There is an explicit action by a subject that results
in information being delivered to an object. Just like a signal or a
UDP packet delivery. Smack handles this kind of thing just fine. The
internal mechanism that results in the access is irrelevant from
this viewpoint. I can understand how a mechanism like SELinux that
works on finer granularity might view it differently.
>
>>
>>
>> Design decisions:
>>
>> Â (1) A misc chardev is used to create and open a ring buffer:
>>
>> ÂÂÂÂfd = open("/dev/watch_queue", O_RDWR);
>>
>> ÂÂÂÂÂ which is then configured and mmap'd into userspace:
>>
>> ÂÂÂÂioctl(fd, IOC_WATCH_QUEUE_SET_SIZE, BUF_SIZE);
>> ÂÂÂÂioctl(fd, IOC_WATCH_QUEUE_SET_FILTER, &filter);
>> ÂÂÂÂbuf = mmap(NULL, BUF_SIZE * page_size, PROT_READ | PROT_WRITE,
>> ÂÂÂÂÂÂÂÂÂÂ MAP_SHARED, fd, 0);
>>
>> ÂÂÂÂÂ The fd cannot be read or written (though there is a facility to use
>> ÂÂÂÂÂ write to inject records for debugging) and userspace just pulls data
>> ÂÂÂÂÂ directly out of the buffer.
>>
>> Â (2) The ring index pointers are stored inside the ring and are thus
>>  accessible to userspace. Userspace should only update the tail
>>  pointer and never the head pointer or risk breaking the buffer. The
>> ÂÂÂÂÂ kernel checks that the pointers appear valid before trying to use
>>  them. A 'skip' record is maintained around the pointers.
>>
>> Â (3) poll() can be used to wait for data to appear in the buffer.
>>
>> Â (4) Records in the buffer are binary, typed and have a length so that they
>> ÂÂÂÂÂ can be of varying size.
>>
>> ÂÂÂÂÂ This means that multiple heterogeneous sources can share a common
>>  buffer. Tags may be specified when a watchpoint is created to help
>> ÂÂÂÂÂ distinguish the sources.
>>
>> Â (5) The queue is reusable as there are 16 million types available, of
>> ÂÂÂÂÂ which I've used 4, so there is scope for others to be used.
>>
>> Â (6) Records are filterable as types have up to 256 subtypes that can be
>>  individually filtered. Other filtration is also available.
>>
>> Â (7) Each time the buffer is opened, a new buffer is created - this means
>> ÂÂÂÂÂ that there's no interference between watchers.
>>
>> Â (8) When recording a notification, the kernel will not sleep, but will
>> ÂÂÂÂÂ rather mark a queue as overrun if there's insufficient space, thereby
>> ÂÂÂÂÂ avoiding userspace causing the kernel to hang.
>>
>> Â (9) The 'watchpoint' should be specific where possible, meaning that you
>> ÂÂÂÂÂ specify the object that you want to watch.
>>
>> (10) The buffer is created and then watchpoints are attached to it, using
>> ÂÂÂÂÂ one of:
>>
>> ÂÂÂÂkeyctl_watch_key(KEY_SPEC_SESSION_KEYRING, fd, 0x01);
>> ÂÂÂÂmount_notify(AT_FDCWD, "/", 0, fd, 0x02);
>> ÂÂÂÂsb_notify(AT_FDCWD, "/mnt", 0, fd, 0x03);
>>
>> ÂÂÂÂÂ where in all three cases, fd indicates the queue and the number after
>> ÂÂÂÂÂ is a tag between 0 and 255.
>>
>> (11) The watch must be removed if either the watch buffer is destroyed or
>> ÂÂÂÂÂ the watched object is destroyed.
>>
>>
>> Things I want to avoid:
>>
>> Â (1) Introducing features that make the core VFS dependent on the network
>> ÂÂÂÂÂ stack or networking namespaces (ie. usage of netlink).
>>
>> Â (2) Dumping all this stuff into dmesg and having a daemon that sits there
>> ÂÂÂÂÂ parsing the output and distributing it as this then puts the
>> ÂÂÂÂÂ responsibility for security into userspace and makes handling
>>  namespaces tricky. Further, dmesg might not exist or might be
>> ÂÂÂÂÂ inaccessible inside a container.
>>
>> Â (3) Letting users see events they shouldn't be able to see.
>>
>>
>> Further things that could be considered:
>>
>> Â (1) Adding a keyctl call to allow a watch on a keyring to be extended to
>> ÂÂÂÂÂ "children" of that keyring, such that the watch is removed from the
>> ÂÂÂÂÂ child if it is unlinked from the keyring.
>>
>> Â (2) Adding global superblock event queue.
>>
>> Â (3) Propagating watches to child superblock over automounts.
>>
>>
>> The patches can be found here also:
>>
>> ÂÂÂÂhttp://git.kernel.org/cgit/linux/kernel/git/dhowells/linux-fs.git/log/?h=notifications
>>
>> Changes:
>>
>> Â v4: Split the basic UAPI bits out into their own patch and then split the
>>  LSM hooks out into an intermediate patch. Add LSM hooks for setting
>> ÂÂÂÂÂ watches.
>>
>> ÂÂÂÂÂ Rename the *_notify() system calls to watch_*() for consistency.
>>
>> Â v3: I've added a USB notification source and reformulated the block
>> ÂÂÂÂÂ notification source so that there's now a common watch list, for which
>> ÂÂÂÂÂ the system call is now device_notify().
>>
>> ÂÂÂÂÂ I've assigned a pair of unused ioctl numbers in the 'W' series to the
>> ÂÂÂÂÂ ioctls added by this series.
>>
>> ÂÂÂÂÂ I've also added a description of the kernel API to the documentation.
>>
>> Â v2: I've fixed various issues raised by Jann Horn and GregKH and moved to
>>  krefs for refcounting. I've added some security features to try and
>> ÂÂÂÂÂ give Casey Schaufler the LSM control he wants.
>>
>> David
>> ---
>> David Howells (13):
>> ÂÂÂÂÂÂ security: Override creds in __fput() with last fputter's creds
>> ÂÂÂÂÂÂ uapi: General notification ring definitions
>> ÂÂÂÂÂÂ security: Add hooks to rule on setting a watch
>> ÂÂÂÂÂÂ security: Add a hook for the point of notification insertion
>> ÂÂÂÂÂÂ General notification queue with user mmap()'able ring buffer
>> ÂÂÂÂÂÂ keys: Add a notification facility
>> ÂÂÂÂÂÂ vfs: Add a mount-notification facility
>> ÂÂÂÂÂÂ vfs: Add superblock notifications
>> ÂÂÂÂÂÂ fsinfo: Export superblock notification counter
>> ÂÂÂÂÂÂ Add a general, global device notification watch list
>> ÂÂÂÂÂÂ block: Add block layer notifications
>> ÂÂÂÂÂÂ usb: Add USB subsystem notifications
>> ÂÂÂÂÂÂ Add sample notification program
>>
>>
>> Â Documentation/ioctl/ioctl-number.txtÂÂ |ÂÂÂ 1
>> Â Documentation/security/keys/core.rstÂÂ |ÂÂ 58 ++
>> Â Documentation/watch_queue.rstÂÂÂÂÂÂÂÂÂ |Â 492 ++++++++++++++++++
>> Â arch/x86/entry/syscalls/syscall_32.tbl |ÂÂÂ 3
>> Â arch/x86/entry/syscalls/syscall_64.tbl |ÂÂÂ 3
>> Â block/KconfigÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂ 9
>> Â block/blk-core.cÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂ 29 +
>> Â drivers/base/KconfigÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂ 9
>> Â drivers/base/MakefileÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂ 1
>> Â drivers/base/watch.cÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂ 89 +++
>> Â drivers/misc/KconfigÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂ 13
>> Â drivers/misc/MakefileÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂ 1
>> Â drivers/misc/watch_queue.cÂÂÂÂÂÂÂÂÂÂÂÂ |Â 889 ++++++++++++++++++++++++++++++++
>> Â drivers/usb/core/KconfigÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂ 10
>> Â drivers/usb/core/devio.cÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂ 55 ++
>> Â drivers/usb/core/hub.cÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂ 3
>> Â fs/KconfigÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂ 21 +
>> Â fs/MakefileÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂ 1
>> Â fs/file_table.cÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂ 12
>> Â fs/fsinfo.cÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂ 12
>> Â fs/mount.hÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂ 33 +
>> Â fs/mount_notify.cÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |Â 187 +++++++
>> Â fs/namespace.cÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂ 9
>> Â fs/super.cÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |Â 122 ++++
>> Â include/linux/blkdev.hÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂ 15 +
>> Â include/linux/dcache.hÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂ 1
>> Â include/linux/device.hÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂ 7
>> Â include/linux/fs.hÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂ 79 +++
>> Â include/linux/key.hÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂ 4
>> Â include/linux/lsm_hooks.hÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂ 48 ++
>> Â include/linux/security.hÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂ 35 +
>> Â include/linux/syscalls.hÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂ 5
>> Â include/linux/usb.hÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂ 19 +
>> Â include/linux/watch_queue.hÂÂÂÂÂÂÂÂÂÂÂ |ÂÂ 87 +++
>> Â include/uapi/linux/fsinfo.hÂÂÂÂÂÂÂÂÂÂÂ |ÂÂ 10
>> Â include/uapi/linux/keyctl.hÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂ 1
>> Â include/uapi/linux/watch_queue.hÂÂÂÂÂÂ |Â 213 ++++++++
>> Â kernel/sys_ni.cÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂ 7
>> Â samples/KconfigÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂ 6
>> Â samples/MakefileÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂ 1
>> Â samples/vfs/test-fsinfo.cÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂ 13
>> Â samples/watch_queue/MakefileÂÂÂÂÂÂÂÂÂÂ |ÂÂÂ 9
>> Â samples/watch_queue/watch_test.cÂÂÂÂÂÂ |Â 308 +++++++++++
>> Â security/keys/KconfigÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂ 10
>> Â security/keys/compat.cÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂ 2
>> Â security/keys/gc.cÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂ 5
>> Â security/keys/internal.hÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂ 30 +
>> Â security/keys/key.cÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂ 37 +
>> Â security/keys/keyctl.cÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂ 95 +++
>> Â security/keys/keyring.cÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂ 17 -
>> Â security/keys/request_key.cÂÂÂÂÂÂÂÂÂÂÂ |ÂÂÂ 4
>> Â security/security.cÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ |ÂÂ 29 +
>> Â 52 files changed, 3121 insertions(+), 38 deletions(-)
>> Â create mode 100644 Documentation/watch_queue.rst
>> Â create mode 100644 drivers/base/watch.c
>> Â create mode 100644 drivers/misc/watch_queue.c
>> Â create mode 100644 fs/mount_notify.c
>> Â create mode 100644 include/linux/watch_queue.h
>> Â create mode 100644 include/uapi/linux/watch_queue.h
>> Â create mode 100644 samples/watch_queue/Makefile
>> Â create mode 100644 samples/watch_queue/watch_test.c
>>
>