Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast

From: Paul E. McKenney
Date: Sun Nov 03 2019 - 09:41:19 EST


On Sat, Nov 02, 2019 at 06:08:42PM +0000, Al Viro wrote:
> On Sat, Nov 02, 2019 at 10:22:29AM -0700, Paul E. McKenney wrote:
> > Ignoring the possibility of the more exotic compiler optimizations, if
> > the first task's load into f sees the value stored by the second task,
> > then the pair of memory barriers guarantee that the first task's load
> > into d will see the second task's store.
>
> The question was about the load into i being also safe.
>
> > In fact, you could instead say this in recent kernels:
> >
> > f = READ_ONCE(fdt[n]) // provides dependency ordering via mb on Alpha
> > mb
>
> Er... that mb comes from expanded READ_ONCE(), actually - the call chain
> is
> fdget_pos() -> __fdget_pos() -> __fdget() -> __fget_light() ->
> __fcheck_files(), either directly or via
> __fget() -> fcheck_files() -> __fcheck_files()
> rcu_dereference_raw() -> READ_ONCE() -> smp_read_barrier_depends()
> which yields mb on alpha.

Exactly. But yes, my "provides dependency ordering via mb on Alpha"
comment was not as clearly stated as it should have been. Something about
my having dispensed with proofreading in order to catch my plane... :-/

> > d = f->f_path.dentry
> > i = d->d_inode // But this is OK only if ->f_path.entry is
> > // constant throughout
>
> Yes, it is - once you hold a reference to a positive dentry, it can't
> be made negative by anybody else. See d_delete() for details; basically,
> if you have refcount > 1, dentry will be unhashed, but not made negative.

Very good, then. At least assuming that you are double-checking the ordering
(as opposed to trying to figure out what is wrong with the ordering). ;-)

> > The result of the first task's load into i requires information outside
> > of the two code fragments.
> >
> > Or am I missing your point?
>
> My point is that barriers sufficient to guarantee visibility of *f in
> the reader will also suffice to guarantee visibility of *f->f_path.dentry.
>
> On alpha it boils down to having load of d->d_inode when opening the
> file orders before the barrier prior to storing the reference to f
> in the descriptor table, so if it observes the store to d->d_inode done
> by the same CPU, that store is ordered before the barrier due to
> processor instruction order constraints and if it observes the store
> to d->d_inode done by some other CPU, that store is ordered before
> the load and before the barrier by transitivity. So in either case
> that store is ordered before the store into descriptor table.
> IOW, the reader that has enough barriers to guarantee seing ->f_path.dentry
> will be guaranteed to see ->f_path.dentry->d_inode.
>
> And yes, we will need some barriers added near some positivity checks in
> pathname resolution - that's what has started the entire thread. This
> part ("any place looking at file->f_path.dentry will have ->d_inode and
> mode bits of ->d_flags visible and stable") covers quite a few places
> that come up in the analysis...
>
> This morning catch, BTW:
>
> audit_get_nd(): don't unlock parent too early
>
> if the child has been negative and just went positive
> under us, we want coherent d_is_positive() and ->d_inode.
> Don't unlock the parent until we'd done that work...
>
> Signed-off-by: Al Viro <viro@xxxxxxxxxxxxxxxxxx>
>
> diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c
> index 1f31c2f1e6fc..4508d5e0cf69 100644
> --- a/kernel/audit_watch.c
> +++ b/kernel/audit_watch.c
> @@ -351,12 +351,12 @@ static int audit_get_nd(struct audit_watch *watch, struct path *parent)
> struct dentry *d = kern_path_locked(watch->path, parent);
> if (IS_ERR(d))
> return PTR_ERR(d);
> - inode_unlock(d_backing_inode(parent->dentry));
> if (d_is_positive(d)) {
> /* update watch filter fields */
> watch->dev = d->d_sb->s_dev;
> watch->ino = d_backing_inode(d)->i_ino;
> }
> + inode_unlock(d_backing_inode(parent->dentry));
> dput(d);
> return 0;
> }
>
> For other fun bits and pieces see ceph bugs caught this week and crap in
> dget_parent() (not posted yet). The former had been ceph violating the
> "turning a previously observable dentry positive requires exclusive lock
> on parent" rule, the latter - genuine insufficient barriers in the fast
> path of dget_parent().
>
> It is converging to a reasonably small and understandable surface, actually,
> most of that being in core pathname resolution. Two big piles of nightmares
> left to review - overlayfs and (somewhat surprisingly) setxattr call chains,
> the latter due to IMA/EVM/LSM insanity...
>
> There's also some secondary stuff dropping out of that (e.g. ceph seeding
> dcache on readdir and blindly unhashing dentries it sees stale instead of
> doing d_invalidate() as it ought to - leads to fun results if you had
> something mounted on a subdirectory that got removed/recreated on server),
> but that's a separate pile of joy - doesn't affect this analysis, so
> it'll have to be dealt with later. It had been an interesting couple of
> weeks...

Looks like you are making a very productive (if perhaps somewhat painful)
tour through the code, good show!

Thanx, Paul