Re: [PATCH v2] vfs: Don't exchange "short" filenames unconditionally.

From: Al Viro
Date: Tue Sep 30 2014 - 20:16:23 EST

On Mon, Sep 29, 2014 at 11:42:18AM -0700, Paul E. McKenney wrote:

> Assuming that incrementing the external name's reference count is
> atomic_add_unless, I could believe this part. Or if you have some
> locking that makes it impossible to increment the reference count
> in any case where there is any risk that it might be decremented
> to zero, I guess.
> Which might well be the pair of write_seqcount_begin() calls in __d_move(),
> now that I look more carefully. So if the name has to be in use somewhere
> before it can be copied, then a copy can only be created if there is at
> least one copy that is not currently being removed? If so, OK.

Huh? copy_name() does copy a _reference_, not the name itself. All the
copying involved is source-> = target-> And those
are simply unsigned char *.

write_seqcount_begin() is irrelevant here. Look: all callers of
__d_move(x, y) are holding references both to x and y. Contributing to
the refcount of dentries themselves, that is, not the names.

That gives exclusion between __d_move() and free_dentry() - the latter cannot
be called until dentry refcount reaches zero. RCU is completely irrelevant
here. In fact, no call chain leads to __d_move() under rcu_read_lock().
You must hold the target dentry hard, or it could simply be freed right
under you.

And __d_move() is taking ->d_lock on all dentries involved (in
addition to rename_lock serializing it system-wide).

What could possibly lead to refcount zero being observed on target of
__d_move()? The history of any dentry is this:
* it is created by __d_alloc(). Nobody can see it until __d_alloc()
returns. Dentry refcount (not to be confused with refcount of external
name) is 1.
* it passes through some (usually - zero) __d_move() calls.
Some - as the first argument, some - as the second one. All those
calls are serialized by global seqlock - callers must hold rename_lock.
And all of them are done by somebody who is holding a counting reference
to dentries in question.
* counting references to dentry might be taken and dropped;
eventually refcount reaches zero (under ->d_lock) and no further
counting references can be taken after that. See __dentry_kill() - the
first thing it does is poisoning the refcount, so that any future
attempt to increment it would fail. __dentry_kill() (still under ->d_lock
of dentry, ->d_lock of its parent and ->i_lock of its inode) removes
dentry from the tree, from hash and from the alias list of inode;
Then it drops the locks. At that point the only search structure dentry
might be found in is shrink list; if it's not on such list, free_dentry()
is called immediately, otherwise it's marked so that the code processing
the shrink list in question would, as soon as it gets to that sucker,
remove it from the shrink list and call the same free_dentry(). And that's
the only thing done to such dentry by somebody finding it via a shrink list.
* once free_dentry() has been reached, dentry can can be only seen
by RCU lookups, and after the grace period ends it gets physically freed.

free_dentry() isn't allowed to overlap __d_move(); to have that happen is
a serious dentry refcounting bug. No __d_move() is allowed _after_
free_dentry() has been entered, either. Again, it would take a refcounting
bug for dentries to have that happen - basically, double dput() somewhere.
If that happens, all bets are off, of course - if dentry gets unexpectedly
freed under somebody who has grabbed a reference to it and has not dropped
it yet, we are fucked.

Nothing outside of __d_move() is allowed to change -> RCU-critical
code is allowed to fetch and dereference it, and such code relies upon
a) freeing of name seen by somebody who'd done rcu_read_lock() being
delayed until after the matching rcu_read_unlock()
b) store of terminating NUL done by __d_alloc() (and never overwritten
afterwards) being seen by RCU-critical code that has found the pointer to
that name in dentry->

All other code accessing -> is required to hold one of the locks
that are held by __d_move() and its callers. Grabbing any of those leads
to smp_mb() on alpha, which serves as data dependency barrier there, so
we don't need explicit barrier there as we do in RCU-critical places - guarding
NUL will be seen.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at