Re: [PATCH v3 04/19] VFS: use wait_var_event for waiting in d_alloc_parallel()
From: NeilBrown
Date: Fri May 01 2026 - 06:46:52 EST
On Fri, 01 May 2026, Al Viro wrote:
> On Fri, May 01, 2026 at 11:39:59AM +1000, NeilBrown wrote:
>
> > I dislike them too. I doubt I can find solutions that either of us
> > like, but they should be relatively short-lived. Once we push the
> > locking down in the the inode_operations the filesystem will be in a
> > position to hold the lock only when it actually needs it (if at all).
>
> ... or we'll end up with hard-to-formulate constraints on what a filesystem
> may do with its internal locking to use the APIs provided by fs/{dcache,namei}.c
> safely.
>
> Note that e.g. "->iterate_shared() wants to know the synthetic inumbers
> a concurrent stat(2) would inject into dcache" (which is the original
> reason for dcache preseeding in that thing) is not uncommon. In procfs
> you are lucky to have no mkdir() and friends; the same is not true in
> general and we'd better have a sane answer to "what could a filesystem
> like that do with its internal locking". Or that thing will get blindly
> copied, with predictable results.
You say "not uncommon" but I can only find it in procfs and overlayfs.
You only need this if there is a backing store that doesn't provide
anything usable as inode numbers.
If there is no backing store, then you'll populate with
d_make_persistent() and provide inode numbers at that time.
If the backing store provides inode numbers - you use them.
For procfs the backing store is various kernel data structures (e.g. the
process table). For overlayfs the backing store is a blend of 2 or more
other filesystems.
In either case, the backing store has its own locking protocol to manage
create/remove/iterate, so the filesystem likely doesn't need to provide
one.
For procfs I imagine that once we push parent parent-locking down in to
the filesystem, we'll simply remove it all from procfs. Then there will
be no problem calling d_alloc_parallel() in iterate_shared as the thread
which owns the in-lookup dentry will never try to lock the parent.
For overlayfs I *think* there is no need for parent locking in ->lookup
so it is possible the same approach will work, though more careful
analysis will be needed.
The current code to drop and retake the lock is only transitional code
to get from here to there is manageable steps.
>
> > I'm confident that dropping the lock is safe. If there was some way to
> > tell the VFS that the lock has already been dropped, then we wouldn't
> > need to reclaim it, but I cannot see a clean way to do that.
>
> FWIW, I'm more concerned about ->iterate_shared() - d_add_ci() is garbage
> that isn't used on a sanely configured kernel; ls -lR is not going away,
> no matter what, and exclusion requirements are going to be a lot more
> interesting for that one anyway. It might be worth teaching iterate_dir()
> that in such-and-such conditions it ought to save position, drop the lock,
> do a lookup on name stashed in dir_context, retake the lock and call back into
> ->iterate_shared() from saved position. With helper callable by ->iterate_shared()
> instances if they run into failing d_alloc_trylock() in a situation where they
> can't just shrug and move on... Not sure.
>
> What kind of exclusion do you have in mind for foo_iterate_shared() in the
> long run? Assuming that filesystem has directory-modifying operations, as
> well as lookups, and its inumbers are synthetic.
I think the VFS will still be taking a shared lock on the parent when
calling ->iterate_shared. This is the easiest way to block rmdir while
a readdir is happening.
>
> BTW, do you have AFS and CIFS counterparts of your stuff from back in 2022
> that killed d_rehash() uses in fs/nfs? I would love to kill d_rehash();
> exfat use is an easily removable junk, but fs/afs and fs/smb/client ones
> are trickier and the reasons why it needed to be killed in fs/nfs apply
> to those as well.
>
For fs/afs:
https://github.com/neilbrown/linux/commit/7d074238b7674245308bb482d29f8b2eb10a6598
https://github.com/neilbrown/linux/commit/a1ab4d991e1cbf02b7ae493798067db75d88ef48
or on lore.kernel.org
20260312214330.3885211-20-neilb@xxxxxxxxxxx
20260312214330.3885211-21-neilb@xxxxxxxxxxx
for fs/smb/client
https://github.com/neilbrown/linux/commit/1df67fbccd424aaaf03a4dfeda07cd8f9e3bc682
https://github.com/neilbrown/linux/commit/3c4f603881db16984b0b44e38a634d9517b9be84
or
20260312214330.3885211-26-neilb@xxxxxxxxxxx
20260312214330.3885211-27-neilb@xxxxxxxxxxx
I don't suppose you have any memory of
Commit: a00be0e ("cifs: don't use ->d_time")
from 10 years ago. Maybe you didn't even see it.
NeilBrown