Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast

From: Ritesh Harjani
Date: Wed Oct 23 2019 - 07:06:02 EST




On 10/23/19 1:41 AM, Al Viro wrote:
On Tue, Oct 22, 2019 at 03:37:36PM +0100, Al Viro wrote:
On Tue, Oct 22, 2019 at 07:08:54PM +0530, Ritesh Harjani wrote:
I think we have still not taken this patch. Al?

or, for that matter, any callers of filename_lookup() assuming that the
lack of ENOENT means that the last call of walk_component() (from lookup_last())
has not failed with the same ENOENT and thus the result has been observed
positive.
You've picked the easiest one to hit, but on e.g. KVM setups you can have the
host thread representing the CPU where __d_set_inode_and_type() runs get
preempted (by the host kernel), leaving others with much wider window.

I had thought so about the other call sites, but as you said
maybe this was the easiest one to hit.
Then, I kind of followed your suggested fix in below link to fix at least this current crash.
https://patchwork.kernel.org/patch/10909881/


Sure, we can do that to all callers of d_is_negative/d_is_positive, but...
the same goes for any places that assumes that d_is_dir() implies that
the sucker is positive, etc.

What we have guaranteed is
* ->d_lock serializes ->d_flags/->d_inode changes
* ->d_seq is bumped before/after such changes
* positive dentry never changes ->d_inode as long as you hold
a reference (negative dentry *can* become positive right under you)

So there are 3 classes of valid users: those holding ->d_lock, those
sampling and rechecking ->d_seq and those relying upon having observed
the sucker they've pinned to be positive.

:) Thanks for simplifying like this. Agreed.




What you've been hitting is "we have it pinned, ->d_flags says it's
positive but we still observe the value of ->d_inode from before the
store to ->d_flags that has made it look positive".

Actually, your patch opens another problem there. Suppose you make
it d_really_is_positive() and hit the same race sans reordering.
Dentry is found by __d_lookup() and is negative. Right after we
return from __d_lookup() another thread makes it positive (a symlink)
- ->d_inode is set, d_really_is_positive() becomes true. OK, on we
go, pick the inode and move on. Right? ->d_flags is still not set
by the thread that made it positive. We return from lookup_fast()
and call step_into(). And get to
if (likely(!d_is_symlink(path->dentry)) ||
Which checks ->d_flags and sees the value from before the sucker
became positive. IOW, d_is_symlink() is false here. If that
was the last path component and we'd been told to follow links,
we will end up with positive dentry of a symlink coming out of
pathname resolution.


hmm. yes, looks like it, based on your above explanation.
So even though this could avoid crash, but still we may end up with
a bogus entry with current patch.



Similar fun happens if you have mkdir racing with lookup - ENOENT
is what should've happened if lookup comes first, success - if
mkdir does. This way we can hit ENOTDIR, due to false negative
from d_can_lookup().

IOW, d_really_is_negative() in lookup_fast() will paper over
one of oopsen, but it
* won't cover similar oopsen on other codepaths and
* will lead to bogus behaviour.

I'm not sure that blanket conversion of d_is_... to smp_load_acquire()
is the right solution; it might very well be that we need to do that
only on a small subset of call sites, lookup_fast() being one of
those. But we do want at least to be certain that something we'd
got from lookup_fast() in non-RCU mode already has ->d_flags visible.

We may also need similar guarantees with __d_clear_type_and_inode().

So do you think we should make use of ->d_seq for verifying this?
I see both __d_set_inode_and_type & __d_clear_type_and_inode() called
under ->d_seq_begin/->d_seq_end.

Then maybe we should use ->d_seq checking at those call sites.
We cannot unconditionally use ->d_seq checking in __d_entry_type(),
since we sometimes call this function inside ->d_seq_begin
(like in lookup_fast).



I'm going through the callers right now, will post a followup once
the things get cleaner...

Thanks for looking into this.