[PATCH RFC 0/4] VFS revoke()

From: David Herrmann
Date: Tue Aug 12 2014 - 14:57:37 EST


Hi

This patchset implements synchronous shutdown of file->f_op->xyz() access. It
provides generic helpers that can be used by any provider of file->f_op to
prevent new tasks from entering a file-operation and synchronously waiting for
all pending file-operations to finish.
Furthermore, it includes two patches to show how it can be used by device
drivers (here: Input-evdev as an example).

There are several use-cases for this patchset:

1) Hotplug Support
Currently, a device driver for hotpluggable hardware has to jump through hoops
to make unplugging work. Imagine a driver that registers a cdev via cdev_add()
in the ->probe() callback on its bus. Once cdev_add() returns, user-space might
start entering file->f_op->xyz() callbacks on the cdev.
Now, if the hardware device is unplugged, the ->remove() callback of the device
driver is called on its bus. The driver now removes the cdev via cdev_del(),
but this does not guarantee that there's no open file-description left.
Furthermore, it does not prevent any further entry to file-operations.
Therefore, the driver has to protect all its f_op callbacks and deny new tasks
whenever the device was removed. This needlessly leaves the device around
(albeit disabled) until all file-descriptions are closed by user-space.
With generic revoke() support, a driver can synchronously shutdown all
file-operations. This way, it can guarantee that all open files are closed
before returning from its ->remove() callback on the bus.

2) Access Management
Once a user-space process was granted access to a device node, there is no way
to revoke that access again. This is problematic for all devices that are shared
between sessions. Imagine switching between multiple graphics servers, there is
currently no way to make sure only the foreground server gets access to graphics
and input devices.
The easiest solution, obviously, is to provide a user-space daemon that
dispatches between those processes. However, this requires to wrap *all* API
calls and reduces performance of device-operations considerably.
Therefore, input and graphics drivers have started to implement access
management as part of their API. To avoid duplicating that code everywhere, we
need a generic revoke() implementation that revokes *just one file-description*.
This way, generic access-management (like the 'logind' daemon) can hand out
file-descriptors to processes and retain a copy themselves. By revoking access
to the file, the process will loose access, too. We'd love to see such revoke
support for webcams, audio devices, disks and more, so we can prevent background
sessions from accessing those.

3) Generic revoke() / frevoke()
Obviously, the generic revoke() syscall can be implemented with this, too.
Unlike 2), this syscall revokes access to *all* open file-descriptions attached
to a device/object. This can be exposed to user-space, but is also handy to
implement existing syscalls like vhangup().


This patchset introduces two new objects:
struct revokable_device:
This is the entry point to revoke-support. Every object that needs revoke
support has to have a revokable_device embedded somewhere. This is usually
part of the "cdev", "inode" or other kinds of parent devices. However, it
can be used freely and does not enforce any restrictions.
The revokable_device object is used as parent for all open files on the
underlying device. That is, each file on that device that is opened via
->open() will be attached to this revokable_device object. Each attached
file can be managed independently of the parent device and can be revoked
by itself.
The revokable_device object now allows to manage all attached files at
once. That is, if you call revoke_device(), this will cause all attached
files to be revoked at the same time and prevent new files from being
opened.
struct revokable_file:
Inside of the ->open() callback, a driver needs to call make_revokable()
if it needs revoke-support on a file. This will create a new
"revokable_file" object and link it from file->f_revoke. The object will
also be attached to the parent revokable_device object that is passed to
make_revokable().
Once make_revokable() returns, the file is active and may be revoked at
any time (maybe even before you return from ->open()).
Each revokable_file can be revoked independently of each other. Once a
device is revoked, any new entry to a file-operation is denied and pending
operations are completed synchronously. Once all those operations are
done, the ->release() callback of the file is called early so the device
driver can detach from it. Once the last file-reference is dropped,
__fput() will no longer call ->release(), as it has already been called.

Unlike previous attempts to implement revoke(), this series implements revoke()
on the file-description level. That is, each open file description can be
revoked independently of each other. This is required to implement
access-management on a per-file level, instead of revoking the whole device.
Furthermore, this series splits revoke_file() and drain_file(). The former only
marks the file as revoked and prevents new file-operations, the latter waits for
pending operations to finish. This split allows drivers to perform any required
actions in between both calls (or avoid draining at all). Note that if
drain_file() is called, the driver needs to wake up sleeping file-operations
after revoke_file() was called. Otherwise, drain_file() might stall.

Open questions:
* Obviously, we need to protect all calls into file->f_op->xyz(). This series
provides enter_file() and leave_file() that can be used like:
if (enter_file(file))
retval = file->f_op->xyz(file, ...);
else
retval = -ENODEV;
Question is, should we do this at the time we actually invoke those
callbacks or should we do it at the syscall-entry time? If we do it right
when calling the f_op, we might and up with stacked calls. This works fine,
but makes drain_file_self() awful to implement.
Note that I have implemented both, but didn't include the patches in this
series as I wasn't sure which to go with and they need much more love...
Making sure we call enter_file/leave_file everywhere will require proper
auditing and I guess people will tell me to drop drain_self(), but lets see..
* Do we need drain_file_self()? That is, do we really want to allow drivers
to drain a file while being inside a f_op callback? It is handy to implement
existing calls like EVIOCREVOKE properly, but makes the whole system a lot
more complex.
It was a nice challenge to implement it, but I'm fine with dropping it again.
* What should revoke() do? Currently, revoke_device() calls revoke_file() on
*all* attached files and prevents any new calls to ->open(). This is handy to
make hotplugging work, but doesn't make much sense if called on real files.
We definitely want to allow new calls to open(), so I guess setting
"device->revoked" to "true" should be optional and only used by drivers
explicitly on unplug. The currently implementation can be made to allow both.
But parallel calls to open() while revoke_device() is called will still be
revoked until revoke_device() really returns.
* What error code do we want to return when a file is revoked? poll() should
probably signal POLLHUP, but what for all the other stuff? ENODEV? EACCES?
For user-space code, a separate EUNPLUG code would be *very* handy to make
dealing with asynchronous unplug more easy. EREVOKED would also work, but
doesn't allow to distinguish between device unplug and plain access
revocation.
* What to do with remaining data in receive buffers? On revoke() we want all
access to the device to be denied, but on unplug we *might* want to allow
user-space to retrieve already queued data in the receive buffers. That is,
poll returns POLLHUP | POLLIN and read() can return already queued data. I
haven't seen any reason to do this, though. Iff hardware shows up that has
really short lifetime and we want all data to be forwarded to user-space, we
might want to add some special flag to allow poll() and read(). Until then, I
think we can safely ignore it.

Notes:
* This series is loosely based on:
- previous mails by Al Viro
- "active"-refs of kernfs by Tejun Heo
- EVIOCREVOKE on evdev
The patches are written from scratch (no previous attempt allowed revoke() on
single files so far), but credits go to these guys, too!
* We probably want a generic kick() callback. With this in place, we can
implement frevoke() easily:
int frevoke(struct file *file)
{
if (!file->f_revoke)
return -ENOTSUPP;
if (!enter_file(file))
return -ENODEV;
revoke_device(file->f_revoke->device);
if (file->f_revoke->device->kick)
file->f_revoke->device->kick(file->f_revoke->device);
leave_file(file);
/* TODO: need ref on device; f_revoke->device may ne NULL here */
drain_device(file->f_revoke->device);
return 0;
}
Implementing revoke() is harder as we need a connection between a dentry and
a revokable_file. We could add inode->i_revoke for that.
In both cases, we need some guarantee that revokable_device does not vanish
while we use it. So we either need to know the parent object and hold a ref
to it, or we add ref-counts to revokable_device.
* mmap() is obviously left to device-drivers to handle in ->release() or
->kick(). I recently posted patches that do page duplication and thus can
give each open-file effectively a MAP_PRIVATE copy at any time. But if
drivers map I/O memory directly into user-space, you're screwed, anyway.
Imho, this is something drivers need to fix on a per-case basis. Same way
drivers handle ->release() regarding mmaps (either leaving them alive or
revoking them, or .. whatever).
* This adds 8 bytes to all "struct file"s that don't use revoke(). I think
that's a fair tradeoff. If you want revoke(), it adds ~40 bytes for
revokable_file and ~80 bytes for each revokable_device.
The fast-paths check file->f_revoke for NULL in enter_file(). Should be fine.
In case you enabled revoke(), you get atomic_inc_unless_negative() and
atomic_dec_return() in fast-paths. Slow-paths (in case the file is revoked)
are heavy, but that should be just fine. I mean, most device release paths
use several synchronize_rcu() calls and more. Some atomic_t magic for
revoke() should be just fine.
* "struct kactive" uses atomic_t heavily to implement active device references.
I have no idea whether it makes sense to do it that way. Shout at me if the
slow-paths should be changed to use a spin-lock..
I'd also appreciate mem-barrier reviews there. I'm not entirely sure I got
them all right.
* Makeing revokable_device->active_files RCU protected would be *really* nice.
I didn't succeed in doing so, though. The hard part is revoke_device() which
somehow needs to iterate the whole list of devices and cannot allow parallel
modifications. But it needs to drop the spin-lock for ->release(). My current
way of moving between two lists works fine without RCU, but not with RCU.
But maybe the current spin-lock protection is just fine. I mean, there's not
much traversal going on, anyway, and modifications are always protected by
the lock.
* This evdev-patches in this series rely on some non-mainstream trivial
cleanups. In case anyone cares, see:
"[PATCH v2] Input: evdev - drop redundant list-locking"
* ...
* I probably have a lot more notes that I forgot again. Any comments welcome. I
just wanted to get this out so people know I'm working on it, and maybe
people can point out whether this is going into the right direction.


Anyway, regardless of revoke(2), frevoke(2) and stuff, this series already makes
writing device-drivers a lot easier. I included two patches that convert one of
the most trivial drivers (and one of the few drivers that actually does proper
unplug handling) to use the new infrastucture (Input: evdev). I have similar
patches for ALSA and DRM, but they require far more cleanups before, so I didn't
include them for now. Note that most drivers are horribly racy regarding
unplugging (ALSA replaces f_op with a dummy and hopes no-one's inside a f_op so
far) or don't care at all (DRM just destroys the device on PCI unplug). Most of
this works fine so far, because we did our best to reduce races to a minimum.
However, it really looks ugly and it would make many developer lives easier if
we had synchronous shutdown.

After adding enter_file()/leave_file() annotations, this series already works
fine on my machine.

Comments welcome!
David

David Herrmann (4):
kactive: introduce generic "active"-refcounts
vfs: add revoke() helpers
Input: evdev - switch to revoke helpers
Input: evdev - drop redundant client_list

drivers/input/evdev.c | 179 +++++++++++----------------
fs/Makefile | 2 +-
fs/file_table.c | 4 +-
fs/revoke.c | 194 ++++++++++++++++++++++++++++++
include/linux/fs.h | 2 +
include/linux/kactive.h | 269 +++++++++++++++++++++++++++++++++++++++++
include/linux/revoke.h | 124 +++++++++++++++++++
lib/Makefile | 2 +-
lib/kactive.c | 312 ++++++++++++++++++++++++++++++++++++++++++++++++
9 files changed, 974 insertions(+), 114 deletions(-)
create mode 100644 fs/revoke.c
create mode 100644 include/linux/kactive.h
create mode 100644 include/linux/revoke.h
create mode 100644 lib/kactive.c

--
2.0.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/