Re: [PATCH 01/10] VFS: support parallel updates in the one directory.

From: NeilBrown
Date: Sun Aug 28 2022 - 21:59:18 EST


On Sat, 27 Aug 2022, Al Viro wrote:
> On Fri, Aug 26, 2022 at 12:10:43PM +1000, NeilBrown wrote:
>
> > +/**
> > + * d_lock_update_nested - lock a dentry before updating
> > + * @dentry: the dentry to be locked
> > + * @base: the parent, or %NULL
> > + * @name: the name in that parent, or %NULL
> > + * @subclass: lockdep locking class.
> > + *
> > + * Lock a dentry in a directory on which a shared-lock may be held, and
> > + * on which parallel updates are permitted.
> > + * If the base and name are given, then on success the dentry will still
> > + * have that base and name - it will not have raced with rename.
> > + * On success, a positive dentry will still be hashed, ensuring there
> > + * was no race with unlink.
> > + * If they are not given, then on success the dentry will be negative,
> > + * which again ensures no race with rename, or unlink.
>
> I'm not sure it's a good idea to have that in one function, TBH.
> Looking at the callers, there are
> * lookup_hash_update()
> lookup_hash_update_len()
> nfsd shite
> filename_create_one()
> filename_create_one_len()
> nfsd shite
> filename_create()
> kern_path_create()
> user_path_create()
> do_mknodat()
> do_mkdirat()
> do_symlinkat()
> do_linkat()
> do_rmdir()
> do_unlinkat()
> * fuckloads of callers in lock_rename_lookup()
> * d_lock_update()
> atomic_open()
> lookup_open()
>
> Only the last two can get NULL base or name. Already interesting,
> isn't it? What's more, splitup between O_CREATE open() on one
> side and everything else that might create, remove or rename on
> the other looks really weird.

Well, O_CREATE is a bit weird.
But I can see that it would be cleaner to pass the dir/name into those
two calls that currently get NULL - then remove the handling of NULL.
Thanks.

>
> > + rcu_read_lock(); /* for d_same_name() */
> > + if (d_unhashed(dentry) && d_is_positive(dentry)) {
> > + /* name was unlinked while we waited */
> > + ret = false;
>
> BTW, what happens if somebody has ->lookup() returning a positive
> unhashed? Livelock on attempt to hit it with any directory-modifying
> syscall? Right now such behaviour is permitted; I don't know if
> anything in the tree actually does it, but at the very least it
> would need to be documented.
>
> Note that *negative* unhashed is not just permitted, it's
> actively used e.g. by autofs. That really needs to be well
> commented...

I hadn't thought that ->lookup() could return anything unhashed. I'll
look into that - thanks.

>
> > + } else if (base &&
> > + (dentry->d_parent != base ||
> > + dentry->d_name.hash != name->hash ||
> > + !d_same_name(dentry, base, name))) {
> > + /* dentry was renamed - possibly silly-rename */
> > + ret = false;
> > + } else if (!base && d_is_positive(dentry)) {
> > + ret = false;
> > + } else {
> > + dentry->d_flags |= DCACHE_PAR_UPDATE;
> > + }
>
> > + * Parent directory has inode locked exclusive, or possibly shared if wq
> > + * is given. In the later case the fs has explicitly allowed concurrent
> > + * updates in this directory. This is the one and only case
> > + * when ->lookup() may be called on a non in-lookup dentry.
>
> What Linus already said about wq... To add a reason he hadn't mentioned,
> the length of call chain one needs to track to figure out whether it's
> NULL or not is... excessive. And I don't mean just "greater than 0".
> We have places like that, and sometimes we have to, but it's never a good
> thing...
>
> > static struct dentry *__lookup_hash(const struct qstr *name,
> > - struct dentry *base, unsigned int flags)
> > + struct dentry *base, unsigned int flags,
> > + wait_queue_head_t *wq)
>
> > - dentry = d_alloc(base, name);
> > + if (wq)
> > + dentry = d_alloc_parallel(base, name, wq);
> > + else
> > + dentry = d_alloc(base, name);
> > if (unlikely(!dentry))
> > return ERR_PTR(-ENOMEM);
> > + if (IS_ERR(dentry))
> > + return dentry;
>
> BTW, considering the fact that we have 12 callers of d_alloc()
> (including this one) and 28 callers of its wrapper (d_alloc_name()), I
> would seriously consider converting d_alloc() from "NULL or new dentry"
> to "ERR_PTR(-ENOMEM) or new dentry". Especially since quite a few of
> those callers will be happier that way. Grep and you'll see... As a
> side benefit, if (unlikely(!dentry)) turns into if (IS_ERR(dentry)).
>
> > +static struct dentry *lookup_hash_update(
> > + const struct qstr *name,
> > + struct dentry *base, unsigned int flags,
> > + wait_queue_head_t *wq)
> > +{
> > + struct dentry *dentry;
> > + struct inode *dir = base->d_inode;
> > + int err;
> > +
> > + if (wq && IS_PAR_UPDATE(dir))
> > + inode_lock_shared_nested(dir, I_MUTEX_PARENT);
> > + else
> > + inode_lock_nested(dir, I_MUTEX_PARENT);
> > +
> > +retry:
> > + dentry = __lookup_hash(name, base, flags, wq);
> > + if (IS_ERR(dentry)) {
> > + err = PTR_ERR(dentry);
> > + goto out_err;
> > + }
> > + if (!d_lock_update_nested(dentry, base, name, I_MUTEX_PARENT)) {
> > + /*
> > + * Failed to get lock due to race with unlink or rename
> > + * - try again
> > + */
> > + d_lookup_done(dentry);
>
> When would we get out of __lookup_hash() with in-lookup dentry?
> Confused...

Whenever wq is passed in and ->lookup() decides, based on the flags, to do
nothing.
NFS does this for LOOKUP_CREATE|LOOKUP_EXCL and for LOOKUP_RENAME_TARGET

>
> > +struct dentry *lookup_hash_update_len(const char *name, int nlen,
> > + struct path *path, unsigned int flags,
>
> const struct path *, damnit...

Sorry....

>
> > + wait_queue_head_t *wq)
> > +{
> > + struct qstr this;
> > + int err = lookup_one_common(mnt_user_ns(path->mnt), name,
> > + path->dentry, nlen, &this);
> > + if (err)
> > + return ERR_PTR(err);
> > + return lookup_hash_update(&this, path->dentry, flags, wq);
> > +}
> > +EXPORT_SYMBOL(lookup_hash_update_len);
>
> Frankly, the calling conventions of the "..._one_len" family is something
> I've kept regretting for a long time. Oh, well...
>
> > +static void done_path_update(struct path *path, struct dentry *dentry,
> > + bool with_wq)
> > +{
> > + struct inode *dir = path->dentry->d_inode;
> > +
> > + d_lookup_done(dentry);
> > + d_unlock_update(dentry);
> > + if (IS_PAR_UPDATE(dir) && with_wq)
> > + inode_unlock_shared(dir);
> > + else
> > + inode_unlock(dir);
> > +}
>
> const struct path *, again...
>
> > @@ -3400,6 +3499,12 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
> > dentry = res;
> > }
> > }
> > + /*
> > + * If dentry is negative and this is a create we need to get
> > + * DCACHE_PAR_UPDATE.
> > + */
> > + if ((open_flag & O_CREAT) && !dentry->d_inode)
> > + have_par_update = d_lock_update(dentry, NULL, NULL);
> >
> > /* Negative dentry, just create the file */
> > if (!dentry->d_inode && (open_flag & O_CREAT)) {
>
> Fold the above here, please. What's more, losing the flag would've
> made it much easier to follow...

Yes, that can certainly be tided up - thanks.

>
> > @@ -3419,9 +3524,13 @@ static struct dentry *lookup_open(struct nameidata *nd, struct file *file,
> > error = create_error;
> > goto out_dput;
> > }
> > + if (have_par_update)
> > + d_unlock_update(dentry);
> > return dentry;
> >
> > out_dput:
> > + if (have_par_update)
> > + d_unlock_update(dentry);
>
>
> > @@ -3821,27 +3962,28 @@ struct dentry *kern_path_create(int dfd, const char *pathname,
> > struct path *path, unsigned int lookup_flags)
>
> BTW, there's 9 callers of that sucker in the entire tree, along with
> 3 callers of user_path_create() and 14 callers of done_path_create().
> Not a big deal to add the wq in those, especially since it can be easily
> split into preparatory patch (with wq passed, but being unused).
>
> > -void done_path_create(struct path *path, struct dentry *dentry)
> > +void done_path_create_wq(struct path *path, struct dentry *dentry, bool with_wq)
>
> Why "with_wq" and not the wq itself?
>

I did that at first. However when I did silly rename in NFS I found
that I wanted to call the 'done' thing when I didn't still have the wq.
...which might mean I have a bug.

And the done_path_create_wq() doesn't do anything with the wq so ...

I'll look at that again - thanks.


> > - * The caller must hold dir->i_mutex.
> > + * The caller must either hold a write-lock on dir->i_rwsem, or
> > + * a have atomically set DCACHE_PAR_UPDATE, or both.
>
> ???

That's a hangover from an earlier version where I didn't set
DCACHE_PAR_UPDATE when we had an exclusive lock. Will fix.


>
> > + * If the filesystem permits (IS_PAR_UPDATE()), we take a shared lock on the
> > + * directory and set DCACHE_PAR_UPDATE to get exclusive access to the dentry.
>
> The latter happens unconditionally here, unless I'm misreading the code (as well
> as your cover letter). It does *NOT* happen on rename(), though, contrary to
> the same. And while your later patch adds it in lock_rename_lookup(), existing
> lock_rename() callers do not get that at all. Likely to be a problem...

Between the patch were DCACHE_PAR_UPDATE is added and the patch were
lock_rename_lookup() is added, all filesystems use exclusive locks and
none of them check DCACHE_PAR_UPDATE. So how can there be a problem?


>
> > --- a/include/linux/dcache.h
> > +++ b/include/linux/dcache.h
> > @@ -13,7 +13,9 @@
> > #include <linux/rcupdate.h>
> > #include <linux/lockref.h>
> > #include <linux/stringhash.h>
> > +#include <linux/sched.h>
>
> Bloody hell, man...
>

I wonder what that was for .... removed.


> > +static inline void d_unlock_update(struct dentry *dentry)
> > +{
> > + if (IS_ERR_OR_NULL(dentry))
> > + return;
>
> Do explain... When could we ever get NULL or ERR_PTR() passed to that?

Another hangover from earlier iterations - removed. Thanks.

>
>
> BTW, I would seriously look into splitting the "let's add a helper
> that combines locking parent with __lookup_hash()" into a preliminary
> patch. Would be easier to follow.
>
Will look into that.

Thanks a lot for the thorough review!

NeilBrown