Re: [PATCH 01/10] VFS: support parallel updates in the one directory.
From: Linus Torvalds
Date: Fri Aug 26 2022 - 15:07:26 EST
On Thu, Aug 25, 2022 at 7:16 PM NeilBrown <neilb@xxxxxxx> wrote:
>
> If a filesystem supports parallel modification in directories, it sets
> FS_PAR_DIR_UPDATE on the file_system_type. lookup_open() and the new
> lookup_hash_update() notice the flag and take a shared lock on the
> directory, and rely on a lock-bit in d_flags, much like parallel lookup
> relies on DCACHE_PAR_LOOKUP.
Ugh.
I absolutely believe in the DCACHE_PAR_LOOKUP model, and in "parallel
updates" being important, but I *despise* locking code like this
+ if (wq && IS_PAR_UPDATE(dir))
+ inode_lock_shared_nested(dir, I_MUTEX_PARENT);
+ else
+ inode_lock_nested(dir, I_MUTEX_PARENT);
and I really really hope there's some better model for this.
That "wq" test in particular is just absolutely disgusting. So now it
doesn't just depend on whether the directory supports parallel
updates, now the caller can choose whether to do the parallel thing or
not, and that's how "create" is different from "rename".
And that last difference is, I think, the one that makes me go "No. HELL NO".
Instead of it being up to the filesystem to say "I can do parallel
creates, but I need to serialize renames", this whole thing has been
set up to be about the caller making that decision.
That's just feels horribly horribly wrong.
Yes, I realize that to you that's just a "later patches will do
renames", but what if it really is a filesystem issue where the
filesystem can easily handle new names, but needs something else for
renames because it has various cross-directory issues, perhaps?
So I feel this is fundamentally wrong, and this ugliness is a small
effect of that wrongness.
I think we should strive for
(a) make that 'wq' and DCACHE_PAR_LOOKUP bit be unconditional
(b) aim for the inode lock being taken *after* the _lookup_hash(),
since the VFS layer side has to be able to handle the concurrency on
the dcache side anyway
(c) at that point, the serialization really ends up being about the
call into the filesystem, and aim to simply move the
inode_lock{_shared]_nested() into the filesystem so that there's no
need for a flag and related conditional locking at all.
Because right now I think the main reason we cannot move the lock into
the filesystem is literally that we've made the lock cover not just
the filesystem part, but the "lookup and create dentry" part too.
But once you have that "DCACHE_PAR_LOOKUP" bit and the
d_alloc_parallel() logic to serialize a _particular_ dentry being
created (as opposed to serializing all the sleeping ops to that
directly), I really think we should strive to move the locking - that
no longer helps the VFS dcache layer - closer to the filesystem call
and eventually into it.
Please? I think these patches are "mostly going in the right
direction", but at the same time I feel like there's some serious
mis-design going on.
Linus