Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast

From: Paul E. McKenney
Date: Sat Nov 02 2019 - 13:22:33 EST


On Fri, Nov 01, 2019 at 11:46:22PM +0000, Al Viro wrote:
> On Wed, Oct 23, 2019 at 04:35:50PM +0530, Ritesh Harjani wrote:
>
> > > > What we have guaranteed is
> > > > * ->d_lock serializes ->d_flags/->d_inode changes
> > > > * ->d_seq is bumped before/after such changes
> > > > * positive dentry never changes ->d_inode as long as you hold
> > > > a reference (negative dentry *can* become positive right under you)
> > > >
> > > > So there are 3 classes of valid users: those holding ->d_lock, those
> > > > sampling and rechecking ->d_seq and those relying upon having observed
> > > > the sucker they've pinned to be positive.
> >
> > :) Thanks for simplifying like this. Agreed.
>
> FWIW, after fixing several ceph bugs, add to that the following:
> * all places that turn a negative dentry into positive one are
> holding its parent exclusive or dentry has not been observable for
> anybody else. It had been present in the parent's list of children
> (negative and unhashed) and it might have been present in in-lookup
> hashtable. However, nobody is going to grab a reference to it from there
> without having grabbed ->d_lock on it and observed the state after
> it became positive.
>
> Which means that holding a reference to dentry *and* holding its
> parent at least shared stabilizes both ->d_inode and type bits in
> ->d_flags. The situation with barriers is more subtle - *IF* we
> had sufficient barriers to have ->d_inode/type bits seen right
> after having gotten the reference, we are fine. The only change
> possible after that point is negative->positive transition and
> that gets taken care of by barriers provided by ->i_rwsem.
>
> If we'd obtained that reference by d_lookup() or __d_lookup(),
> we are fine - ->d_lock gives a barrier. The same goes for places
> that grab references during a tree traversal, provided that they
> hold ->d_lock around that (fs/autofs/expire.c stuff). The same goes
> for having it found in inode's aliases list (->i_lock).
>
> I really hope that the same applies to accesses to file_dentry(file);
> on anything except alpha that would be pretty much automatic and
> on alpha we get the things along the lines of
>
> f = fdt[n]
> mb
> d = f->f_path.dentry
> i = d->d_inode
> assert(i != NULL)
> vs.
> see that d->d_inode is non-NULL
> f->f_path.dentry = d
> mb
> fdt[n] = f

Ignoring the possibility of the more exotic compiler optimizations, if
the first task's load into f sees the value stored by the second task,
then the pair of memory barriers guarantee that the first task's load
into d will see the second task's store.

In fact, you could instead say this in recent kernels:

f = READ_ONCE(fdt[n]) // provides dependency ordering via mb on Alpha
mb
d = f->f_path.dentry
i = d->d_inode // But this is OK only if ->f_path.entry is
// constant throughout
assert(i != NULL)
vs.
see that d->d_inode is non-NULL
f->f_path.dentry = d
mb
fdt[n] = f

The result of the first task's load into i requires information outside
of the two code fragments.

Or am I missing your point?

> IOW, the barriers that make it safe to fetch the fields of struct file
> (rcu_dereference_raw() in __fcheck_files() vs. smp_store_release()
> in __fd_install() in the above) should *hopefully* take care of all
> stores visible by the time of do_dentry_open(). Sure, alpha cache
> coherency is insane, but AFAICS it's not _that_ insane.

Agreed, not -that- insane. ;-)

> Question to folks familiar with alpha memory model:
>
> A = 0, B = NULL, C = NULL
> CPU1:
> A = 1
>
> CPU2:
> r1 = A
> if (r1) {
> B = &A
> mb
> C = &B
> }
>
> CPU3:
> r2 = C;
> mb
> if (r2) { // &B
> r3 = *r2 // &A
> r4 = *r3 // 1
> assert(r4 == 1)
> }
>
> is the above safe on alpha?

Looks that way to me. LKMM agrees:

C viro

{
}

P0(int *a)
{
WRITE_ONCE(*a, 1);
}

P1(int *a, int **b, int ***c)
{
int r1;

r1 = READ_ONCE(*a);
if (r1) {
WRITE_ONCE(*b, a);
smp_mb();
WRITE_ONCE(*c, b);
}
}

P2(int ***c)
{
int **r2;
int *r3;
int r4;

r2 = READ_ONCE(*c);
smp_mb();
if (r2) {
r3 = READ_ONCE(*r2);
r4 = READ_ONCE(*r3);
}
}

exists (1:r1=1 /\ ~2:r2=0 /\ 2:r4=0)

Which gets us this:

$ herd7 -conf linux-kernel.cfg /tmp/viro.litmus
Test viro Allowed
States 3
1:r1=0; 2:r2=0; 2:r4=0;
1:r1=1; 2:r2=0; 2:r4=0;
1:r1=1; 2:r2=b; 2:r4=1;
No
Witnesses
Positive: 0 Negative: 3
Condition exists (1:r1=1 /\ not (2:r2=0) /\ 2:r4=0)
Observation viro Never 0 3
Time viro 0.01
Hash=cabcc7f3122771a04dd21686b2d58124

The state "1:r1=1; 2:r2=b; 2:r4=0;" does not appear, as expected.

Thanx, Paul

> [snip]
>
> > We may also need similar guarantees with __d_clear_type_and_inode().
>
> Not really - pinned dentry can't go negative. In any case, with the
> audit I've done so far, I don't believe that blanket solutions like
> that are good idea - most of the places doing checks are safe as it is.
> The surface that needs to be taken care of is fairly small, actually;
> most of that is in fs/namei.c and fs/dcache.c.