Re: [PATCHSET][RFC][CFT] parallel lookups
From: Andreas Dilger
Date: Fri Apr 15 2016 - 23:02:27 EST
On Apr 15, 2016, at 6:52 PM, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
>
> The thing appears to be working. It's in vfs.git#work.lookups; the
> last 5 commits are the infrastructure (fs/namei.c and fs/dcache.c; no changes
> in fs/*/*) + actual switch to rwsem.
>
> The missing bits: down_write_killable() (there had been a series
> posted introducing just that; for now I've replaced mutex_lock_killable()
> calls with plain inode_lock() - they are not critical for any testing and
> as soon as down_write_killable() gets there I'll replace those), lockdep
> bits might need corrections and right now it's only for lookups.
>
> I'm going to add readdir to the mix; the primitive added in this
> series (d_alloc_parallel()) will need to be used in dcache pre-seeding
> paths, ncpfs use of dentry_update_name_case() will need to be changed to
> something less hacky and syscalls calling iterate_dir() will need to
> switch to fdget_pos() (with FMODE_ATOMIC_POS set for directories as well
> as regulars). The last bit is needed for exclusion on struct file
> level - there's a bunch of cases where we maintain data structures
> hanging off file->private and those really need to be serialized. Besides,
> serializing ->f_pos updates is needed for sane semantics; right now we
> tend to use ->i_mutex for that, but it would be easier to go for the same
> mechanism as for regular files. With any luck we'll have working parallel
> readdir in addition to parallel lookups in this cycle as well.
>
> The patchset is on top of switching getxattr to passing dentry and
> inode separately; that part will get changes (in particular, the stuff
> agruen has posted lately), but the lookups queue proper cares only about
> being able to move security_d_instantiate() to the point before dentry
> is attached to inode.
>
> 1/15: security_d_instantiate(): move to the point prior to attaching dentry
> to inode. Depends on getxattr changes, allows to do the "attach to inode"
> and "add to dentry hash" parts without dropping ->d_lock in between.
>
> 2/15 -- 8/15: preparations - stuff similar to what went in during the last
> cycle; several places switched to lookup_one_len_unlocked(), a bunch of
> direct manipulations of ->i_mutex replaced with inode_lock, etc. helpers.
>
> kernfs: use lookup_one_len_unlocked().
> configfs_detach_prep(): make sure that wait_mutex won't go away
> ocfs2: don't open-code inode_lock/inode_unlock
> orangefs: don't open-code inode_lock/inode_unlock
> reiserfs: open-code reiserfs_mutex_lock_safe() in reiserfs_unpack()
> reconnect_one(): use lookup_one_len_unlocked()
> ovl_lookup_real(): use lookup_one_len_unlocked()
>
> 9/15: lookup_slow(): bugger off on IS_DEADDIR() from the very beginning
> open-code real_lookup() call in lookup_slow(), move IS_DEADDIR check upwards.
>
> 10/15: __d_add(): don't drop/regain ->d_lock
> that's what 1/15 had been for; might make sense to reorder closer to it.
>
> 11/15 -- 14/15: actual machinery for parallel lookups. This stuff could've
> been a single commit, along with the actual switch to rwsem and shared lock
> in lookup_slow(), but it's easier to review if carved up like that. From the
> testing POV it's one chunk - it is bisect-safe, but the added code really
> comes into play only after we go for shared lock, which happens in 15/15.
> That's the core of the series.
>
> beginning of transition to parallel lookups - marking in-lookup dentries
> parallel lookups machinery, part 2
> parallel lookups machinery, part 3
> parallel lookups machinery, part 4 (and last)
>
> 15/15: parallel lookups: actual switch to rwsem
>
> Note that filesystems would be free to switch some of their own uses of
> inode_lock() to grabbing it shared - it's really up to them. This series
> works only with directories locking, but this field has become an rwsem
> for all inodes. XFS folks in particular might be interested in using it...
Looks very interesting, and long awaited. How do you see the parallel
operations moving forward? Staying as lookup only, or moving on to parallel
modifications as well?
We've been carrying an out-of-tree patch for ext4 for several years to allow
parallel create/unlink for directory entries*, as I discussed a few times with
you in the past. It is still a bit heavyweight for doing read-only lookups,
but after this patch series it might finally be interesting to merge into ext4,
with a hope that the VFS might allow parallel directory changes in the future?
We can already do this on a Lustre server, and it would be nice to be able to
so on the client, since the files may even be on different servers (hashed by
name at the client to decide which server to contact) and network latency during
parallel file creates (one thread per CPU core, which is getting into the
low hundreds these days) is a much bigger deal than for local filesystems.
The actual inode_*lock() handling would need to be delegated to the filesystems,
with the VFS just using i_rwsem if the filesystem has no ->inode_lock() method,
but calling the inode_lock() method of the filesystem if available, so that
they can do parallel create/unlink for files and directories (though not rename since that way lies insanity).
We can discuss more at LSF, but now that you posted your patch series, I'm
curious. :-)
Cheers, Andreas
[*] Patch not usable as-is since it has no VFS interface and is only uptodate
for 3.12, just linked here in case of interest. Essentially, it creates a lock
tree for the ext4 htree directory, and opportunistically only locks the leaf
blocks of the directory on the expectation that there will be ~50-60 entries
added to each leaf before it is split, before it needs to back out to write
lock the parent htree block. The larger the directory, the more leaf and
parent blocks that can be locked concurrently, and the better scaling gets.
http://git.hpdd.intel.com/fs/lustre-release.git/blob/HEAD:/ldiskfs/kernel_patches/patches/sles12/ext4-pdirop.patch
> I'll post the individual patches in followups. Again, this is also available
> in vfs.git #work.lookups (head at e2d622a right now). The thing survives
> LTP and xfstests without regressions, but more testing would certainly be
> appreciated. So would review, of course.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Cheers, Andreas
Attachment:
signature.asc
Description: Message signed with OpenPGP using GPGMail