Re: [PATCH v3 00/24] Convert vfs.txt to vfs.rst

From: Al Viro
Date: Tue Apr 02 2019 - 13:54:11 EST


On Tue, Apr 02, 2019 at 05:48:24PM +0100, Al Viro wrote:

> ->d_prune() must not drop/regain any of the locks held by caller.
> It must _not_ free anything attached to dentry - that belongs
> later in the shutdown sequence. If anything, I'm tempted to
> make it take const struct dentry * as argument, just to make
> that clear.
>
> No new (counted) references can be acquired by that point;
> lockless dcache lookup might find our dentry a match, but
> result of such lookup is not going to be legitimized - it's
> doomed to be thrown out as stale.
>
> It really makes more sense as part of struct dentry lifecycle
> description...
>
> [1] in theory, ->d_time might be changed by overlapping lockless
> call of ->d_revalidate(). Up to filesystem - VFS doesn't touch
> that field (and AFAICS only NFS uses it these days).

One addition: ->d_prune() can overlap only with
* lockless ->d_hash()/->d_compare()
* lockless ->d_revalidate()
* lockless ->d_manage()
So it must not destroy anything used by those without an RCU
delay. The same goes for ->d_release() - both the list of
the things it can overlap with and requirements re RCU delays.

In-tree ->d_prune() instances are fine and so's the majority
of ->d_release(). However, autofs ->d_release() has something
that looks like an RCU use-after-free waiting to happen:
static void autofs_dentry_release(struct dentry *de)
{
struct autofs_info *ino = autofs_dentry_ino(de);
struct autofs_sb_info *sbi = autofs_sbi(de->d_sb);

pr_debug("releasing %p\n", de);

if (!ino)
return;
...
autofs_free_ino(ino);
}
with autofs_free_ino() being straight kfree(). Which means
that the lockless case of autofs_d_manage() can run into
autofs_dentry_ino(dentry) getting freed right under it.

And there we do have this reachable:
int autofs_expire_wait(const struct path *path, int rcu_walk)
{
struct dentry *dentry = path->dentry;
struct autofs_sb_info *sbi = autofs_sbi(dentry->d_sb);
struct autofs_info *ino = autofs_dentry_ino(dentry);
int status;
int state;

/* Block on any pending expire */
if (!(ino->flags & AUTOFS_INF_WANT_EXPIRE))
return 0;
if (rcu_walk)
return -ECHILD;

the second check buggers off in lockless mode; the first one
can be done in lockless mode just fine, so AFAICS we do have
a problem there. Smells like we ought to make that kfree
in autofs_free_ino() RCU-delayed... Ian, have you, by any
chance, run into reports like that? Use-after-free or
oopsen in autofs_expire_wait() and friends, that is...

AFAICS, everything else is safe; however, looking through
those has turned up a fishy spot in ceph_d_revalidate():
} else if (d_really_is_positive(dentry) &&
ceph_snap(d_inode(dentry)) == CEPH_SNAPDIR) {
valid = 1;
Again, lockless ->d_revalidate() is called without anything
that would hold ->d_inode stable; the first part of the
condition does not guarantee that we won't run into
ceph_snap(NULL) and oops. Sure, compiler is almost certainly
not going to reload here, but we ought to use d_inode_rcu(dentry)
there.

Sigh...