Re: [rfc][possible solution] RCU vfsmounts

From: Al Viro
Date: Tue Oct 01 2013 - 21:30:20 EST


On Mon, Sep 30, 2013 at 08:49:21PM +0100, Al Viro wrote:
> OK... AFAICS, we are not too far from being able to handle RCU pathwalk
> straying into fs in the middle of being shut down.
> * There are 5 methods that can be called:
> ->d_hash(...)
> ->d_compare(...)
> ->d_revalidate(..., LOOKUP_RCU | ...)
> ->d_manage(..., true)
> ->permission(..., MAY_NOT_BLOCK | MAY_EXEC)
> Filesystem needs to be able to survive those during shutdown. The stuff
> needed for that should _not_ be freed without synchronize_rcu() (or via
> call_rcu()); usually ->s_fs_info is involved (when anything is needed
> at all). In any case, we shouldn't allow rmmod without making sure that
> everything in RCU mode has run out, but most of the filesystems have
> rcu_barrier() in their exit_module anyway.

... and unregister_filesystem() has synchronize_rcu(), which leaves takes
care of everything other than callbacks in fs code fed to call_rcu().
So the things are even less painful.

> * __put_super() probably ought to delay actual freeing via
> call_rcu(); might not be strictly necessary, but probably a good idea
> anyway.
> * shrink_dcache_for_umount() ought to use d_walk(), a-la
> shrink_dcache_parent().
>
> Note that most of the filesystems don't have any of these methods or
> don't look at anything outside of inode/dentry involved in RCU case.
> Zoo:
>
> * adfs: has the name length limit in fs-private part of superblock; used
> by ->d_hash() and ->d_compare(). No other methods involved, synchronize_rcu()
> before doing kfree() in adfs_put_super() will suffice.
>
> * autofs4: wants fs-private part of superblock in ->d_manage().
> synchronize_rcu() in autofs4_kill_sb() would do it, or we could delay
> freeing that sucker via call_rcu() (in that case we want delayed
> freeing in __put_super() as well).
>
> * btrfs: wants btrfs_root_readonly(BTRFS_I(inode)->root) usable in
> ->permission(). Delayed freeing of struct btrfs_root, perhaps?
>
> * cifs: wants nls, refered to from fs-private part of superblock.
> ->permission() wants fs-private part of superblock as well. Just
> synchronize_rcu() before unload_nls() in cifs_umount()...
>
> * fat: same situation as with cifs
>
> * fuse: delayed freeing of struct fuse_conn? BTW, Miklos, just what is
> } else if (mask & (MAY_ACCESS | MAY_CHDIR)) {
> if (mask & MAY_NOT_BLOCK)
> return -ECHILD;
> about, when we never pass such combinations? Oh, well...
>
> * hpfs: similar to cifs and fat, only without use of nls (a homegrown table
> of some sort).
>
> * ncpfs: _probably_ similar to cifs et.al., but there might be dragons
>
> * procfs: delayed freeing of pid_namespace?
>
> * lustre: messy, haven't looked through that.

Extremely preliminary version is in vfs.git #experimental. No fs-specific
fixes mentioned above are in there; relevant stuff is in the end of
queue -
initialize namespace_sem statically
fs_is_visible only needs namespace_sem held shared
dup_mnt_ns(): get rid of pointless grabbing of vfsmount_lock
do_remount(): pull touch_mnt_namespace() up
fold mntfree() into mntput_no_expire()
fs/namespace.c: bury long-dead define
finish_automount() doesn't need vfsmount_lock for removal from expiry list
mnt_set_expiry() doesn't need vfsmount_lock
fold dup_mnt_ns() into its only surviving caller
namespace.c: get rid of mnt_ghosts
don't bother with vfsmount_lock in mounts_poll()
new helpers: lock_mount_hash/unlock_mount_hash
isofs: don't pass dentry to isofs_hash{i,}_common()
uninline destroy_super(), consolidate alloc_super()
split __lookup_mnt() in two functions
move taking vfsmount_lock down into prepend_path()
RCU'd vfsmounts
fs/dcache.c | 221 +++++++++++++----------------
fs/internal.h | 4 -
fs/isofs/inode.c | 12 +-
fs/mount.h | 20 +++-
fs/namei.c | 87 +++++------
fs/namespace.c | 386 +++++++++++++++++++++++++------------------------
fs/pnode.c | 13 +-
fs/proc_namespace.c | 8 +-
fs/super.c | 206 +++++++++++---------------
include/linux/mount.h | 2 +
include/linux/namei.h | 2 +-
11 files changed, 455 insertions(+), 506 deletions(-)

With fs fixes added it'll probably still end up with slightly negative
balance...

It's barely tested (i.e. no beating for races, etc. - boots and shuts down
without any apparent problems, but that's it) and the last commit almost
certainly needs a splitup. Everything prior to it can probably go into
-next and I hope to carve standalone pieces from it as well. Review and
comments are welcome; personally, I prefer to use git fetch for review,
but if somebody prefers posted mailbomb, yell and I'll send it...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/