Re: [RESEND PATCH 4/5] kernfs: Replace per-fs rwsem with hashed rwsems.

From: Imran Khan
Date: Wed Jan 04 2023 - 04:12:14 EST


Hello Tejun

On 14/11/2022 4:30 am, Imran Khan wrote:

[...]
>
> Below is reasoning and data for my experiments with other approaches.
>
> Since hashed kernfs_rwsem approach has been encountering problems in addressing
> some corner cases, I am thinking if some alternative approach can be taken here
> to keep kernfs_rwsem global, but replace its usage at some places with
> alternative global/hashed rwsems.
>
> For example from the current kernfs code we can see that most of the contention
> towards kernfs_rwsem is observed in down_read/up_read emanating from
> kernfs_iop_permission and kernfs_dop_revalidate:
>
> - 39.16% 38.98% showgids [kernel.kallsyms] [k] down_read
> 38.98% __libc_start_main
> __open_nocancel
> entry_SYSCALL_64_after_hwframe
> do_syscall_64
> sys_open
> do_sys_open
> do_filp_open
> - path_openat
> - 36.54% link_path_walk
> - 20.23% inode_permission
> - __inode_permission
> - 20.22% kernfs_iop_permission
> down_read
> - 15.06% walk_component
> lookup_fast
> d_revalidate.part.24
> kernfs_dop_revalidate
> down_read
> - 1.25% kernfs_iop_get_link
> down_read
> - 1.25% may_open
> inode_permission
> __inode_permission
> kernfs_iop_permission
> down_read
> - 1.20% lookup_fast
> d_revalidate.part.24
> kernfs_dop_revalidate
> down_read
> - 28.96% 28.83% showgids [kernel.kallsyms] [k] up_read
> 28.83% __libc_start_main
> __open_nocancel
> entry_SYSCALL_64_after_hwframe
> do_syscall_64
> sys_open
> do_sys_open
> do_filp_open
> - path_openat
> - 28.42% link_path_walk
> - 18.09% inode_permission
> - __inode_permission
> - 18.07% kernfs_iop_permission
> up_read
> - 9.08% walk_component
> lookup_fast
> - d_revalidate.part.24
> - 9.08% kernfs_dop_revalidate
> up_read
> - 1.25% kernfs_iop_get_link
>
> In the above snippet down_read/up_read of kernfs_rwsem is taking ~68% of CPU. We
> also know that cache line bouncing for kernfs_rwsem is major contributor towards
> this overhead because as per [2], changing kernfs_rwsem to a per-cpu
> kernfs_rwsem reduced this to a large extent.
>
> Now kernfs_iop_permission is taking kernfs_rwsem to access inode attributes
> which are not accessed in kernfs_dop_revalidate (the other path contending for
> kernfs_rwsem). So if we use a separate rwsem for protecting inode attributes we
> can see that contention towards kernfs_rwsem greatly reduces. For example using
> a global rwsem (kernfs_iattr_rwsem) to protect inode attributes as per following
> patch:
>
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 6acd9c3d4cff..f185427131f9 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -757,11 +757,13 @@ int kernfs_add_one(struct kernfs_node *kn)
> goto out_unlock;
>
> /* Update timestamps on the parent */
> + down_write(&root->kernfs_iattr_rwsem);
> ps_iattr = parent->iattr;
> if (ps_iattr) {
> ktime_get_real_ts64(&ps_iattr->ia_ctime);
> ps_iattr->ia_mtime = ps_iattr->ia_ctime;
> }
> + up_write(&root->kernfs_iattr_rwsem);
>
> up_write(&root->kernfs_rwsem);
>
> @@ -1442,10 +1444,12 @@ static void __kernfs_remove(struct kernfs_node *kn)
> pos->parent ? pos->parent->iattr : NULL;
>
> /* update timestamps on the parent */
> + down_write(&root->kernfs_iattr_rwsem);
> if (ps_iattr) {
> ktime_get_real_ts64(&ps_iattr->ia_ctime);
> ps_iattr->ia_mtime = ps_iattr->ia_ctime;
> }
> + up_write(&root->kernfs_iattr_rwsem);
>
> kernfs_put(pos);
> }
> diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
> index 74f3453f4639..1b8bffc6d2d3 100644
> --- a/fs/kernfs/inode.c
> +++ b/fs/kernfs/inode.c
> @@ -101,9 +101,9 @@ int kernfs_setattr(struct kernfs_node *kn, const struct
> iattr *iattr)
> int ret;
> struct kernfs_root *root = kernfs_root(kn);
>
> - down_write(&root->kernfs_rwsem);
> + down_write(&root->kernfs_iattr_rwsem);
> ret = __kernfs_setattr(kn, iattr);
> - up_write(&root->kernfs_rwsem);
> + up_write(&root->kernfs_iattr_rwsem);
> return ret;
> }
>
> @@ -119,7 +119,7 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns,
> struct dentry *dentry,
> return -EINVAL;
>
> root = kernfs_root(kn);
> - down_write(&root->kernfs_rwsem);
> + down_write(&root->kernfs_iattr_rwsem);
> error = setattr_prepare(&init_user_ns, dentry, iattr);
> if (error)
> goto out;
> @@ -132,7 +132,7 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns,
> struct dentry *dentry,
> setattr_copy(&init_user_ns, inode, iattr);
>
> out:
> - up_write(&root->kernfs_rwsem);
> + up_write(&root->kernfs_iattr_rwsem);
> return error;
> }
> @@ -189,10 +189,10 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
> struct kernfs_node *kn = inode->i_private;
> struct kernfs_root *root = kernfs_root(kn);
>
> - down_read(&root->kernfs_rwsem);
> + down_read(&root->kernfs_iattr_rwsem);
> kernfs_refresh_inode(kn, inode);
> generic_fillattr(&init_user_ns, inode, stat);
> - up_read(&root->kernfs_rwsem);
> + up_read(&root->kernfs_iattr_rwsem);
>
> return 0;
> }
>
> @@ -285,10 +285,10 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns,
> kn = inode->i_private;
> root = kernfs_root(kn);
>
> - down_read(&root->kernfs_rwsem);
> + down_read(&root->kernfs_iattr_rwsem);
> kernfs_refresh_inode(kn, inode);
> ret = generic_permission(&init_user_ns, inode, mask);
> - up_read(&root->kernfs_rwsem);
> + up_read(&root->kernfs_iattr_rwsem);
>
> return ret;
> }
>
> diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
> index fc5821effd97..4620b74f44b0 100644
> --- a/fs/kernfs/kernfs-internal.h
> +++ b/fs/kernfs/kernfs-internal.h
> @@ -47,6 +47,7 @@ struct kernfs_root {
>
> wait_queue_head_t deactivate_waitq;
> struct rw_semaphore kernfs_rwsem;
> + struct rw_semaphore kernfs_iattr_rwsem;
> };
>
>
>
> greatly reduces the CPU usage seen earlier in down_read/up_read:
>
>
> - 13.08% 13.02% showgids [kernel.kallsyms] [k] down_read
> 13.02% __libc_start_main
> __open_nocancel
> entry_SYSCALL_64_after_hwframe
> do_syscall_64
> sys_open
> do_sys_open
> do_filp_open
> - path_openat
> - 12.18% link_path_walk
> - 9.44% inode_permission
> - __inode_permission
> - 9.43% kernfs_iop_permission
> down_read
> - 2.53% walk_component
> lookup_fast
> d_revalidate.part.24
> kernfs_dop_revalidate
> down_read
> + 0.62% may_open
> - 13.03% 12.97% showgids [kernel.kallsyms] [k] up_read
> 12.97% __libc_start_main
> __open_nocancel
> entry_SYSCALL_64_after_hwframe
> do_syscall_64
> sys_open
> do_sys_open
> do_filp_open
> - path_openat
> - 12.86% link_path_walk
> - 11.55% inode_permission
> - __inode_permission
> - 11.54% kernfs_iop_permission
> up_read
> - 1.06% walk_component
> lookup_fast
> - d_revalidate.part.24
> - 1.06% kernfs_dop_revalidate
>
> As can be seen down_read/up_read CPU usage is ~26% (compared to ~68% of default
> case).
>
> Further using a hashed rwsem for protecting inode attributes as per following patch:
>
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 6acd9c3d4cff..dfc0d2167d86 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -734,6 +734,7 @@ int kernfs_add_one(struct kernfs_node *kn)
> struct kernfs_iattrs *ps_iattr;
> bool has_ns;
> int ret;
> + struct rw_semaphore *rwsem;
>
> down_write(&root->kernfs_rwsem);
>
> @@ -757,11 +758,13 @@ int kernfs_add_one(struct kernfs_node *kn)
> goto out_unlock;
>
> /* Update timestamps on the parent */
> + rwsem = kernfs_iattr_down_write(kn);
> ps_iattr = parent->iattr;
> if (ps_iattr) {
> ktime_get_real_ts64(&ps_iattr->ia_ctime);
> ps_iattr->ia_mtime = ps_iattr->ia_ctime;
> }
> + kernfs_iattr_up_write(rwsem, kn);
>
> up_write(&root->kernfs_rwsem);
>
> @@ -1443,8 +1446,10 @@ static void __kernfs_remove(struct kernfs_node *kn)
>
> /* update timestamps on the parent */
> if (ps_iattr) {
> + struct rw_semaphore *rwsem =
> kernfs_iattr_down_write(pos->parent);
> ktime_get_real_ts64(&ps_iattr->ia_ctime);
> ps_iattr->ia_mtime = ps_iattr->ia_ctime;
> + kernfs_iattr_up_write(rwsem, kn);
> }
>
> kernfs_put(pos);
> diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
> index 74f3453f4639..2b3cd5a9464f 100644
> --- a/fs/kernfs/inode.c
> +++ b/fs/kernfs/inode.c
> @@ -99,11 +99,12 @@ int __kernfs_setattr(struct kernfs_node *kn, const struct
> iattr *iattr)
> int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr)
> {
> int ret;
> + struct rw_semaphore *rwsem;
> struct kernfs_root *root = kernfs_root(kn);
>
> - down_write(&root->kernfs_rwsem);
> + rwsem = kernfs_iattr_down_write(kn);
> ret = __kernfs_setattr(kn, iattr);
> - up_write(&root->kernfs_rwsem);
> + kernfs_iattr_up_write(rwsem, kn);
> return ret;
> }
>
> @@ -114,12 +115,13 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns,
> struct dentry *dentry,
> struct kernfs_node *kn = inode->i_private;
> struct kernfs_root *root;
> int error;
> + struct rw_semaphore *rwsem;
>
> if (!kn)
> return -EINVAL;
>
> root = kernfs_root(kn);
> - down_write(&root->kernfs_rwsem);
> + rwsem = kernfs_iattr_down_write(kn);
> error = setattr_prepare(&init_user_ns, dentry, iattr);
> if (error)
> goto out;
> @@ -132,7 +134,7 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns,
> struct dentry *dentry,
> setattr_copy(&init_user_ns, inode, iattr);
>
> out:
> - up_write(&root->kernfs_rwsem);
> + kernfs_iattr_up_write(rwsem, kn);
> return error;
> }
>
> @@ -188,11 +190,12 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
> struct inode *inode = d_inode(path->dentry);
> struct kernfs_node *kn = inode->i_private;
> struct kernfs_root *root = kernfs_root(kn);
> + struct rw_semaphore *rwsem;
>
> - down_read(&root->kernfs_rwsem);
> + rwsem = kernfs_iattr_down_read(kn);
> kernfs_refresh_inode(kn, inode);
> generic_fillattr(&init_user_ns, inode, stat);
> - up_read(&root->kernfs_rwsem);
> + kernfs_iattr_up_read(rwsem, kn);
>
> return 0;
> }
> @@ -278,6 +281,7 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns,
> struct kernfs_node *kn;
> struct kernfs_root *root;
> int ret;
> + struct rw_semaphore *rwsem;
>
> if (mask & MAY_NOT_BLOCK)
> return -ECHILD;
> @@ -285,10 +289,10 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns,
> kn = inode->i_private;
> root = kernfs_root(kn);
>
> - down_read(&root->kernfs_rwsem);
> + rwsem = kernfs_iattr_down_read(kn);
> kernfs_refresh_inode(kn, inode);
> ret = generic_permission(&init_user_ns, inode, mask);
> - up_read(&root->kernfs_rwsem);
> + kernfs_iattr_up_read(rwsem, kn);
>
> return ret;
> }
> diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
> index fc5821effd97..bd1ecd126395 100644
> --- a/fs/kernfs/kernfs-internal.h
> +++ b/fs/kernfs/kernfs-internal.h
> @@ -169,4 +169,53 @@ extern const struct inode_operations kernfs_symlink_iops;
> * kernfs locks
> */
> extern struct kernfs_global_locks *kernfs_locks;
> +
> +static inline struct rw_semaphore *kernfs_iattr_rwsem_ptr(struct kernfs_node *kn)
> +{
> + int idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
> +
> + return &kernfs_locks->iattr_rwsem[idx];
> +}
> +
> +static inline struct rw_semaphore *kernfs_iattr_down_write(struct kernfs_node *kn)
> +{
> + struct rw_semaphore *rwsem;
> +
> + kernfs_get(kn);
> +
> + rwsem = kernfs_iattr_rwsem_ptr(kn);
> +
> + down_write(rwsem);
> +
> + return rwsem;
> +}
> +
> +static inline void kernfs_iattr_up_write(struct rw_semaphore *rwsem,
> + struct kernfs_node *kn)
> +{
> + up_write(rwsem);
> +
> + kernfs_put(kn);
> +}
> +
> +
> +static inline struct rw_semaphore *kernfs_iattr_down_read(struct kernfs_node *kn)
> +{
> + struct rw_semaphore *rwsem;
> +
> + kernfs_get(kn);
> +
> + rwsem = kernfs_iattr_rwsem_ptr(kn);
> +
> + down_read(rwsem);
> +
> + return rwsem;
> +}
> +
> +static inline void kernfs_iattr_up_read(struct rw_semaphore *rwsem,
> + struct kernfs_node *kn)
> +{
> + up_read(rwsem);
> +
> + kernfs_put(kn);
> +}
> #endif /* __KERNFS_INTERNAL_H */
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index d0859f72d2d6..f282e5d65d04 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -392,8 +392,10 @@ static void __init kernfs_mutex_init(void)
> {
> int count;
>
> - for (count = 0; count < NR_KERNFS_LOCKS; count++)
> + for (count = 0; count < NR_KERNFS_LOCKS; count++) {
> mutex_init(&kernfs_locks->open_file_mutex[count]);
> + init_rwsem(&kernfs_locks->iattr_rwsem[count]);
> + }
> }
>
> static void __init kernfs_lock_init(void)
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 73f5c120def8..fcbf5e7c921c 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -89,6 +89,7 @@ struct kernfs_iattrs;
> */
> struct kernfs_global_locks {
> struct mutex open_file_mutex[NR_KERNFS_LOCKS];
> + struct rw_semaphore iattr_rwsem[NR_KERNFS_LOCKS];
> };
>
> enum kernfs_node_type {
>
>
> gives further improvement in CPU usage:
>
> - 8.26% 8.22% showgids [kernel.kallsyms] [k] down_read
> 8.19% __libc_start_main
> __open_nocancel
> entry_SYSCALL_64_after_hwframe
> do_syscall_64
> sys_open
> do_sys_open
> do_filp_open
> - path_openat
> - 7.59% link_path_walk
> - 6.66% walk_component
> lookup_fast
> - d_revalidate.part.24
> - 6.66% kernfs_dop_revalidate
> down_read
> - 0.71% kernfs_iop_get_link
> down_read
> - 0.58% lookup_fast
> d_revalidate.part.24
> kernfs_dop_revalidate
> down_read
> - 7.44% 7.41% showgids [kernel.kallsyms] [k] up_read
> 7.39% __libc_start_main
> __open_nocancel
> entry_SYSCALL_64_after_hwframe
> do_syscall_64
> sys_open
> do_sys_open
> do_filp_open
> - path_openat
> - 7.36% link_path_walk
> - 6.45% walk_component
> lookup_fast
> d_revalidate.part.24
> kernfs_dop_revalidate
> up_read
>
> In above snippet CPU usage in down_read/up_read has gone down to ~16%
>
> So do you think that rather than replacing global kernfs_rwsem with a hashed one
> , any of the above mentioned 2 patches (1. Use a global rwsem for protecting
> inode attributes or 2. Use a hashed rwsem for protecting inode attributes)
> can be used. These patches are not breaking code paths involving multiple nodes
> that currently use global kernfs_rwsem.
> With hashed kernfs_rwsem I guess there will always be a risk of some corner case
> getting missed.
>
> If any of these approaches are acceptable, I can send the patch for review along
> with other changes of this series
>
Could you please share your feedback about the above mentioned 2 approaches to
reduce contention around global kernfs_rwsem ? Also the first 2 patches of this
series [1] are not dealing specifically with with kernfs_rwsem, so could you
please share your feedback about those 2 as well.

Thanks,
-- Imran

[1]: https://lore.kernel.org/lkml/20220810111017.2267160-1-imran.f.khan@xxxxxxxxxx/