Re: [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory
From: Christian Brauner
Date: Thu Feb 06 2025 - 09:37:09 EST
On Thu, Feb 06, 2025 at 04:42:37PM +1100, NeilBrown wrote:
> This is my latest attempt at removing the requirement for an exclusive
> lock on a directory which performing updates in this. This version,
> inspired by Dave Chinner, goes a step further and allow async updates.
>
> The inode operation still requires the inode lock, at least a shared
> lock, but may return -EINPROGRES and then continue asynchronously
> without needing any ongoing lock on the directory.
>
> An exclusive lock on the dentry is held across the entire operation.
>
> This change requires various extra checks. rmdir must ensure there is
> no async creation still happening. rename between directories must
> ensure non of the relevant ancestors are undergoing async rename. There
> may be or checks that I need to consider - mounting?
Mounting takes an exclusive lock on the target inode in do_lock_mount()
and finish_automount(). As long as dont_mount() can't happen
asynchronously in vfs_rmdir(), vfs_unlink() or vfs_rename() it should be
fine.
>
> One other important change since my previous posting is that I've
> dropped the idea of taking a separate exclusive lock on the directory
> when the fs doesn't support shared locking. This cannot work as it
> doeesn't prevent lookups and filesystems don't expect a lookup while
> they are changing a directory. So instead we need to choose between
> exclusive or shared for the inode on a case-by-case basis.
Which is possibly fine if we do it similar to what I suggested in the
series. As it stands with the separate methods it's a no-go for me. But
that's a solvable problem, I think.
> To make this choice we divide all ops into four groups: create, remove,
> rename, open/create. If an inode has no operations in the group that
> require an exclusive lock, then a flag is set on the inode so that
> various code knows that a shared lock is sufficient. If the flag is not
> set, an exclusive lock is obtained.
>
> I've also added rename handling and converted NFS to use all _async ops.
>
> The motivation for this comes from the general increase in scale of
> systems. We can support very large directories and many-core systems
> and applications that choose to use large directories can hit
> unnecessary contention.
>
> NFS can easily hit this when used over a high-latency link.
> Lustre already has code to allow concurrent directory updates in the
> back-end filesystem (ldiskfs - a slightly modified ext4).
> Lustre developers believe this would also benefit the client-side
> filesystem with large core counts.
>
> The idea behind the async support is to eventually connect this to
> io_uring so that one process can launch several concurrent directory
> operations. I have not looked deeply into io_uring and cannot be
> certain that the interface I've provided will be able to be used. I
> would welcome any advice on that matter, though I hope to find time to
> explore myself. For now if any _async op returns -EINPROGRESS we simply
> wait for the callback to indicate completion.
>
> Test status: only light testing. It doesn't easily blow up, but lockdep
> complains that repeated calls to d_update_wait() are bad, even though
> it has balanced acquire and release calls. Weird?
>
> Thanks,
> NeilBrown
>
> [PATCH 01/19] VFS: introduce vfs_mkdir_return()
> [PATCH 02/19] VFS: use global wait-queue table for d_alloc_parallel()
> [PATCH 03/19] VFS: use d_alloc_parallel() in lookup_one_qstr_excl()
> [PATCH 04/19] VFS: change kern_path_locked() and
> [PATCH 05/19] VFS: add common error checks to lookup_one_qstr()
> [PATCH 06/19] VFS: repack DENTRY_ flags.
> [PATCH 07/19] VFS: repack LOOKUP_ bit flags.
> [PATCH 08/19] VFS: introduce lookup_and_lock() and friends
> [PATCH 09/19] VFS: add _async versions of the various directory
> [PATCH 10/19] VFS: introduce inode flags to report locking needs for
> [PATCH 11/19] VFS: Add ability to exclusively lock a dentry and use
> [PATCH 12/19] VFS: enhance d_splice_alias to accommodate shared-lock
> [PATCH 13/19] VFS: lock dentry for ->revalidate to avoid races with
> [PATCH 14/19] VFS: Ensure no async updates happening in directory
> [PATCH 15/19] VFS: Change lookup_and_lock() to use shared lock when
> [PATCH 16/19] VFS: add lookup_and_lock_rename()
> [PATCH 17/19] nfsd: use lookup_and_lock_one() and
> [PATCH 18/19] nfs: change mkdir inode_operation to mkdir_async
> [PATCH 19/19] nfs: switch to _async for all directory ops.