Re: [PATCH] eventfs: Have inodes have unique inode numbers
From: Steven Rostedt
Date: Sun Jan 28 2024 - 16:11:22 EST
On Sun, 28 Jan 2024 12:53:31 -0800
Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Sun, 28 Jan 2024 at 12:15, Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
> >
> > I have to understand how the dentry lookup works. Basically, when the
> > ei gets deleted, it can't be freed until all dentries it references
> > (including its children) are no longer being accessed. Does that lookup
> > get called only when a dentry with the name doesn't already exist?
>
> Dentry lookup gets called with the parent inode locked for reading, so
> a lookup can happen in parallel with readdir and other dentry lookup.
>
> BUT.
>
> Each dentry is also "self-serializing", so you will never see a lookup
> on the same name (in the same directory) concurrently.
The above is what I wanted to know.
>
> The implementation is very much internal to the VFS layer, and it's
> all kinds of nasty, with a new potential lookup waiting for the old
> one, verifying that the old one is still usable, and maybe repeating
> it all until we find a successful previous lookup or we're the only
> dentry remaining.
>
> It's nasty code that is very much in the "Al Viro" camp, but the point
> is that any normal filesystem should treat lookups as being concurrent
> with non-creation events, but not concurrently multiples.
>
> There *is* some common code with "atomic_open()", where filesystems
> that implement that then want to know if it's the *first* lookup, or a
> use of a previously looked up dentry, and they'll use the
> "d_in_lookup()" thing to determine that. So this whole "keep track of
> which dentries are *currently* being looked up is actually exposed,
> but any normal filesystem should never care.
>
> But if you think you have that issue (tracefs does not), you really
> want to talk to Al extensively.
>
> > That is, can I assume that eventfs_root_lookup() is only called when
> > the VFS file system could not find an existing dentry and it has to
> > create one?
>
> Correct. For any _particular_ name, you should think of lookup as serialized.
>
> > If that's the case, then I can remove the ei->dentry and just add a ref
> > counter that it was accessed. Then the final dput() should call
> > eventfs_set_ei_status_free() (I hate that name and need to change it),
> > and if the ei->is_freed is set, it can free the ei.
>
> Note that the final 'dput()' will happen *after* the dentry has been
> removed, so what can happen is
>
> lookup("name", d1);
> ... lookup successful, dentry is used ..
> ... dentry at some point has no more users ..
> ... memory pressure prunes unused dentries ..
> ... dentry gets unhashed and is no longer visible ..
> lookup("name", d2);
> ... new dentry is created ..
> final dput(d1);
> .. old dentry - that wasn't accessible any more is freed ..
Actually I was mistaken. I'm looking at the final iput() not dput().
>
> and this is actually one of the *reasons* that virtual filesystems
> must not try to cache dentry pointers in their internal data
> structures. Because note how the fuilesystem saw the new lookup(d2) of
> the same name *before* it saw the >d_release(d1) of the old dentry.
>
> And the above is fundamental: we obviously cannot call '->d_release()'
> until the old dentry is all dead and buried (lockref_mark_dead() etc),
> so pretty much by definition you'll have that ordering being possible.
>
> It's extremely unlikely, of course. I'll be you'll never hit it in testing.
>
> So if if you associate some internal data structure with a dentry,
> just *what* happens when you haven't been told abotu the old dentry
> being dead when the new one happens?
>
> See why I say that it's fundamentally wrong for a filesystem to try to
> track dentries? All the operations that can use a dentry will get one
> passed down to them by the VFS layer. The filesystem has no business
> trying to remember some dentry from a previous operation, and the
> filesystem *will* get it wrong.
>
> But also note how refcounting works fine. In fact, refcounting is
> pretty much the *only* thing that works fine. So what you *should* do
> is
>
> - at lookup(), when you save your filesystem data in "->d_fsdata",
> you increment a refcount
>
> - at ->d_release(), you decrement a refcount
>
> and now you're fine. Yes, when the above (very very unlikely)
> situation happens, you'll temporarily have a refcount incremented
> twice, but that's kind of the *point* of refcounts.
>
> Side note: this is pretty much true of any kernel data structure. If
> you have a kernel data structure that isn't just used within one
> thread, it must be refcounted. But it'as *doubly* true when you save
> references to something that the VFS maintains, because you DO NOT
> CONTROL the lifetime of that entity.
>
> > The eventfs_remove_dir() can free the ei (after SRCU) if it has no
> > references, otherwise it needs to wait for the final dput() to do the
> > free.
>
> Honestly, you should just *always* do refcounting. No "free after RCU
> delay" as an alternative. Just refcount it.
>
> Now, the RCU delay may be needed if the lookup of said structure
> happens under RCU, but no, saying "I use SRCU to make sure the
> lifetime is at least X" is just broken.
>
> The refcount is what gives the lifetime. Any form of RCU-delaying
> should then be purely about non-refcounting RCU lookups that may
> happen as the thing is dying (and said lookup should *look* at the
> refcount and say "oh, this is dead, I'm not returning this".
>
> > I think the ei->dentry was required for the dir wrapper logic that we
> > removed.
>
> I think all of this was due to the bogus readdir that created dentries
> willy-nilly and without the required serialization.
>
> And that was all horribly broken. It wasn't even the above kind of
> "really subtle race that you'll never hit in practice" broken. It was
> just absolutely broken with readdir and lookup racing on the same name
> and creating an unholy dentry mess.
Hmm, if I understand the above, I could get rid of keeping around
dentry and even remove the eventfs_set_ei_status_free().
I can try something to see if it works.
-- Steve