Re: [PATCH 09/19] VFS: add _async versions of the various directory modifying inode_operations
From: NeilBrown
Date: Thu Feb 06 2025 - 20:46:31 EST
On Fri, 07 Feb 2025, Christian Brauner wrote:
> On Thu, Feb 06, 2025 at 04:42:46PM +1100, NeilBrown wrote:
> > These "_async" versions of various inode operations are only guaranteed
> > a shared lock on the directory but if the directory isn't exclusively
> > locked then they are guaranteed an exclusive lock on the dentry within
> > the directory (which will be implemented in a later patch).
> >
> > This will allow a graceful transition from exclusive to shared locking
> > for directory updates, and even to async updates which can complete with
> > no lock on the directory - only on the dentry.
> >
> > mkdir_async is a bit different as it optionally returns a new dentry
> > for cases when the filesystem is not able to use the original dentry.
> > This allows vfs_mkdir_return() to avoid the need for an extra lookup.
> >
> > Signed-off-by: NeilBrown <neilb@xxxxxxx>
> > ---
> > Documentation/filesystems/locking.rst | 51 ++++++++-
> > Documentation/filesystems/porting.rst | 10 ++
> > Documentation/filesystems/vfs.rst | 24 +++++
> > fs/namei.c | 142 +++++++++++++++++++++-----
> > include/linux/fs.h | 24 +++++
> > 5 files changed, 223 insertions(+), 28 deletions(-)
> >
> > diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
> > index d20a32b77b60..adeead366332 100644
> > --- a/Documentation/filesystems/locking.rst
> > +++ b/Documentation/filesystems/locking.rst
> > @@ -62,15 +62,24 @@ inode_operations
> > prototypes::
> >
> > int (*create) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t, bool);
> > + int (*create_async) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t, bool, struct dirop_ret *);
>
> If we end up doing this then imho the correct thing to do would be to
> extend the existing operations. Yes, that's more work I know as I've
> done that multiple times myself and it's a bit more annoying churn but
> we shouldn't just keep adding new methods without a good reason.
>
> I assume that you've done that mostly so that you wouldn't be held up by
> menial work for the prototype. That's obviously fine. But for the final
> thing we should just fixup everyone.
I did it this way because it follows a pattern I've seen before.
readdir -> iterate -> iterate_shared
ioctl -> unlocked_ioctl
There are three changes happening here:
1/ add "struct dirop_ret *ret" to the end of each function. That could
certainly be done across all filesystems in one patch
2/ change mkdir to return a dentry. The might be doable in a single
patch if NFS is the only filesystem affected. The change is
sufficiently intrusive that maintainers would want to review it
carefully and might want to land it through their own tree. But I
suspect there are other filesystems that would be affected and I think
it would be prohibitive to try to land this sort of change to multiple
filesystems in a single patch.
3/ change these functions to work with only a shared lock on the
directory. I could try to do something a bit like what Linus
did in
Commit 3e3271549670 ("vfs: get rid of old '->iterate' directory operation")
but the circumstances are quite different and the excuses he used
there don't apply. Also I would need to add an i_rwsem to every
inode which is unlikely to go down well with the maintainers. I'm
sure the active ones would want to manage that change themselves.
So while I hope we get to the point of discarding all the non-async
operations in a little less than the 7 years that it took to get rid of
->iterate, I don't see how to make the change without introducing new
inode_operations.
>
> > struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
> > int (*link) (struct dentry *,struct inode *,struct dentry *);
> > + int (*link_async) (struct dentry *,struct inode *,struct dentry *, struct dirop_ret *);
> > int (*unlink) (struct inode *,struct dentry *);
> > + int (*unlink_async) (struct inode *,struct dentry *, struct dirop_ret *);
> > int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *,const char *);
> > + int (*symlink_async) (struct mnt_idmap *, struct inode *,struct dentry *,const char *m , struct dirop_ret *);
> > int (*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t);
> > + struct dentry * (*mkdir_async) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t, struct dirop_ret *);
> > int (*rmdir) (struct inode *,struct dentry *);
> > + int (*rmdir_async) (struct inode *,struct dentry *, struct dirop_ret *);
> > int (*mknod) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t);
> > + int (*mknod_async) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t, struct dirop_ret *);
> > int (*rename) (struct mnt_idmap *, struct inode *, struct dentry *,
> > struct inode *, struct dentry *, unsigned int);
> > + int (*rename_async) (struct mnt_idmap *, struct inode *, struct dentry *,
> > + struct inode *, struct dentry *, unsigned int, struct dirop_ret *);
> > int (*readlink) (struct dentry *, char __user *,int);
> > const char *(*get_link) (struct dentry *, struct inode *, struct delayed_call *);
> > void (*truncate) (struct inode *);
> > @@ -84,6 +93,9 @@ prototypes::
> > int (*atomic_open)(struct inode *, struct dentry *,
> > struct file *, unsigned open_flag,
> > umode_t create_mode);
> > + int (*atomic_open_async)(struct inode *, struct dentry *,
> > + struct file *, unsigned open_flag,
> > + umode_t create_mode, struct dirop_ret *);
> > int (*tmpfile) (struct mnt_idmap *, struct inode *,
> > struct file *, umode_t);
> > int (*fileattr_set)(struct mnt_idmap *idmap,
> > @@ -95,18 +107,33 @@ prototypes::
> > locking rules:
> > all may block
> >
> > +All directory-modifying operations are called with an exclusive lock on
> > +the target dentry or dentries using DCACHE_PAR_LOOKUP. This allows the
> > +shared lock on i_rwsem for the _async ops to be safe. The lock on
> > +i_rwsem may be dropped as soon as the op returns, though if it returns
> > +-EINPROGRESS the lock using DCACHE_PAR_UPDATE will not be dropped until
> > +the callback is called.
> > +
> > ============== ==================================================
> > ops i_rwsem(inode)
> > ============== ==================================================
> > lookup: shared
> > create: exclusive
> > +create_async: shared
> > link: exclusive (both)
> > +link_async: exclusive on source, shared on target
> > mknod: exclusive
> > +mknod_async: shared
> > symlink: exclusive
> > +symlink_async: shared
> > mkdir: exclusive
> > +mkdir_async: shared
> > unlink: exclusive (both)
> > +unlink_async: exclusive on object, shared on directory/name
> > rmdir: exclusive (both)(see below)
> > +rmdir_async: exclusive on object, shared on directory/name (see below)
> > rename: exclusive (both parents, some children) (see below)
> > +rename_async: shared (both parents) exclusive (some children) (see below)
> > readlink: no
> > get_link: no
> > setattr: exclusive
> > @@ -118,6 +145,7 @@ listxattr: no
> > fiemap: no
> > update_time: no
> > atomic_open: shared (exclusive if O_CREAT is set in open flags)
> > +atomic_open_async: shared (if O_CREAT is not set, then may not have exclusive lock on name)
> > tmpfile: no
> > fileattr_get: no or exclusive
> > fileattr_set: exclusive
> > @@ -125,8 +153,10 @@ get_offset_ctx no
> > ============== ==================================================
> >
> >
> > - Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_rwsem
> > - exclusive on victim.
> > + Additionally, ->rmdir(), ->unlink() and ->rename(), as well as _async
> > + versions, have ->i_rwsem exclusive on victim. This exclusive lock
> > + may be dropped when the op completes even if the async operation is
> > + continuing.
> > cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
> > ->unlink() and ->rename() have ->i_rwsem exclusive on all non-directories
> > involved.
> > @@ -135,6 +165,23 @@ get_offset_ctx no
> > See Documentation/filesystems/directory-locking.rst for more detailed discussion
> > of the locking scheme for directory operations.
> >
> > +The _async operations will be passed a (non-NULL) struct dirop_ret pointer::
> > +
> > + struct dirop_ret {
> > + union {
> > + int err;
> > + struct dentry *dentry;
> > + };
> > + void (*done_cb)(struct dirop_ret*);
> > + };
> > +
> > +They may return -EINPROGRESS (or ERR_PTR(-EINPROGRESS)) in which case
> > +the op will continue asynchronously. When it completes the result,
> > +which must NOT be -EINPROGRESS, is stored in err or dentry (as
> > +appropriate) and the done_cb() function is called. Callers can only
> > +make use of the asynchrony when they determine that no lock need be held
> > +on i_rwsem.
> > +
> > xattr_handler operations
> > ========================
> >
> > diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
> > index 1639e78e3146..a736c9f30d9d 100644
> > --- a/Documentation/filesystems/porting.rst
> > +++ b/Documentation/filesystems/porting.rst
> > @@ -1157,3 +1157,13 @@ in normal case it points into the pathname being looked up.
> > NOTE: if you need something like full path from the root of filesystem,
> > you are still on your own - this assists with simple cases, but it's not
> > magic.
> > +
> > +---
> > +
> > +**recommended**
> > +
> > +create_async, link_async, unlink_async, rmdir_async, mknod_async,
> > +rename_async, atomic_open_async can be provided instead of the
> > +corresponding inode_operations with the "_async" suffix. Multiple
> > +_async operations can be performed in a given directory concurrently,
> > +but never on the same name.
> > diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> > index 31eea688609a..e18655054e6c 100644
> > --- a/Documentation/filesystems/vfs.rst
> > +++ b/Documentation/filesystems/vfs.rst
> > @@ -491,15 +491,24 @@ As of kernel 2.6.22, the following members are defined:
> >
> > struct inode_operations {
> > int (*create) (struct mnt_idmap *, struct inode *,struct dentry *, umode_t, bool);
> > + int (*create_async) (struct mnt_idmap *, struct inode *,struct dentry *, umode_t, bool, struct dirop_ret *);
> > struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int);
> > int (*link) (struct dentry *,struct inode *,struct dentry *);
> > + int (*link_async) (struct dentry *,struct inode *,struct dentry *, struct dirop_ret *);
> > int (*unlink) (struct inode *,struct dentry *);
> > + int (*unlink_async) (struct inode *,struct dentry *, struct dirop_ret *);
> > int (*symlink) (struct mnt_idmap *, struct inode *,struct dentry *,const char *);
> > + int (*symlink_async) (struct mnt_idmap *, struct inode *,struct dentry *,const char *, struct dirop_ret *);
> > int (*mkdir) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t);
> > + struct dentry * (*mkdir_async) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t, struct dirop_ret *);
> > int (*rmdir) (struct inode *,struct dentry *);
> > + int (*rmdir_async) (struct inode *,struct dentry *, struct dirop_ret *);
> > int (*mknod) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t);
> > + int (*mknod_async) (struct mnt_idmap *, struct inode *,struct dentry *,umode_t,dev_t, struct dirop_ret *);
> > int (*rename) (struct mnt_idmap *, struct inode *, struct dentry *,
> > struct inode *, struct dentry *, unsigned int);
> > + int (*rename_async) (struct mnt_idmap *, struct inode *, struct dentry *,
> > + struct inode *, struct dentry *, unsigned int, struct dirop_ret *);
> > int (*readlink) (struct dentry *, char __user *,int);
> > const char *(*get_link) (struct dentry *, struct inode *,
> > struct delayed_call *);
> > @@ -511,6 +520,8 @@ As of kernel 2.6.22, the following members are defined:
> > void (*update_time)(struct inode *, struct timespec *, int);
> > int (*atomic_open)(struct inode *, struct dentry *, struct file *,
> > unsigned open_flag, umode_t create_mode);
> > + int (*atomic_open_async)(struct inode *, struct dentry *, struct file *,
> > + unsigned open_flag, umode_t create_mode, struct dirop_ret *);
> > int (*tmpfile) (struct mnt_idmap *, struct inode *, struct file *, umode_t);
> > struct posix_acl * (*get_acl)(struct mnt_idmap *, struct dentry *, int);
> > int (*set_acl)(struct mnt_idmap *, struct dentry *, struct posix_acl *, int);
> > @@ -524,6 +535,7 @@ Again, all methods are called without any locks being held, unless
> > otherwise noted.
> >
> > ``create``
> > +``create_async``
> > called by the open(2) and creat(2) system calls. Only required
> > if you want to support regular files. The dentry you get should
> > not have an inode (i.e. it should be a negative dentry). Here
> > @@ -546,29 +558,39 @@ otherwise noted.
> > directory inode semaphore held
> >
> > ``link``
> > +``link_async``
> > called by the link(2) system call. Only required if you want to
> > support hard links. You will probably need to call
> > d_instantiate() just as you would in the create() method
> >
> > ``unlink``
> > +``unlink_async``
> > called by the unlink(2) system call. Only required if you want
> > to support deleting inodes
> >
> > ``symlink``
> > +``symlink_async``
> > called by the symlink(2) system call. Only required if you want
> > to support symlinks. You will probably need to call
> > d_instantiate() just as you would in the create() method
> >
> > ``mkdir``
> > +``mkdir_async``
> > called by the mkdir(2) system call. Only required if you want
> > to support creating subdirectories. You will probably need to
> > call d_instantiate() just as you would in the create() method
> >
> > + mkdir_async can return an alternate dentry, much like lookup.
> > + In this case the original dentry will still be negative and will
> > + be unhashed.
> > +
> > ``rmdir``
> > +``rmdir_async``
> > called by the rmdir(2) system call. Only required if you want
> > to support deleting subdirectories
> >
> > ``mknod``
> > +``mknod_async``
> > called by the mknod(2) system call to create a device (char,
> > block) inode or a named pipe (FIFO) or socket. Only required if
> > you want to support creating these types of inodes. You will
> > @@ -576,6 +598,7 @@ otherwise noted.
> > create() method
> >
> > ``rename``
> > +``rename_async``
> > called by the rename(2) system call to rename the object to have
> > the parent and name given by the second inode and dentry.
> >
> > @@ -647,6 +670,7 @@ otherwise noted.
> > itself and call mark_inode_dirty_sync.
> >
> > ``atomic_open``
> > +``atomic_open_async``
> > called on the last component of an open. Using this optional
> > method the filesystem can look up, possibly create and open the
> > file in one atomic operation. If it wants to leave actual
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 3c0feca081a2..eadde9de73bf 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -123,6 +123,41 @@
> > * PATH_MAX includes the nul terminator --RR.
> > */
> >
> > +static void dirop_done_cb(struct dirop_ret *dret)
> > +{
> > + wake_up_var(dret);
> > +}
> > +
> > +#define DO_DIROP(dir, op, ...) \
> > + ({ \
> > + struct dirop_ret dret; \
> > + int ret; \
> > + dret.err = -EINPROGRESS; \
> > + dret.done_cb = dirop_done_cb; \
> > + ret = (dir)->i_op->op(__VA_ARGS__, &dret); \
> > + if (ret == -EINPROGRESS) { \
> > + wait_var_event(&dret, \
> > + dret.err != -EINPROGRESS); \
> > + ret = dret.err; \
> > + } \
> > + ret; \
> > + })
> > +
> > +#define DO_DE_DIROP(dir, op, ...) \
> > + ({ \
> > + struct dirop_ret dret; \
> > + struct dentry *ret; \
> > + dret.dentry = ERR_PTR(-EINPROGRESS); \
> > + dret.done_cb = dirop_done_cb; \
> > + ret = (dir)->i_op->op(__VA_ARGS__, &dret); \
> > + if (ret == ERR_PTR(-EINPROGRESS)) { \
> > + wait_var_event(&dret, \
> > + dret.dentry != ERR_PTR(-EINPROGRESS)); \
> > + ret = dret.dentry; \
> > + } \
> > + ret; \
> > + })
>
> We should also try to avoid these ugly wrappers. That'll be easier if we
> don't have multiple methods as well.
I don't think that the multiple methods make a whole lot of difference
here. The same code would be in the function, it is just indented one
more level when there are multiple functions. But if you would prefer
all the duplication of boiler-plate I can do it that way.
Thanks,
NeilBrown