Re: [RFC PATCH] kernfs: release kernfs_mutex before the inode allocation

From: Wenwu Hou

Date: Mon Jun 29 2026 - 23:28:02 EST

On Tue, Nov 16, 2021 at 08:49:46PM +0100, Greg Kroah-Hartman wrote:
> On Tue, Nov 16, 2021 at 11:43:17AM -0800, Minchan Kim wrote:
> > The kernfs implementation has big lock granularity(kernfs_rwsem) so
> > every kernfs-based(e.g., sysfs, cgroup, dmabuf) fs are able to compete
> > the lock. Thus, if one of userspace goes the sleep under holding
> > the lock for a long time, rest of them should wait it. A example is
> > the holder goes direct reclaim with the lock since it needs memory
> > allocation. Let's fix it at common technique that release the lock
> > and then allocate the memory. Fortunately, kernfs looks like have
> > an refcount so I hope it's fine.
> >
> > Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx>
> > ---
> > fs/kernfs/dir.c | 14 +++++++++++---
> > fs/kernfs/inode.c | 2 +-
> > fs/kernfs/kernfs-internal.h | 1 +
> > 3 files changed, 13 insertions(+), 4 deletions(-)
>
> What workload hits this lock to cause it to be noticable?
>
> There was a bunch of recent work in this area to make this much more
> fine-grained, and the theoritical benchmarks that people created (adding
> 10s of thousands of scsi disks at boot time) have gotten better.

Hi,

By 2026, the kernfs_rwsem has been split into per-fs. But the problem still
exists.

In a k8s cluster environment, there are a lot of processes reading /sys, for example:

Container runtime: runc
Prometheus Exporter: node_exporter

If a tenant's container processes read /sys and sleep on the OOM (or direct
reclaim) path, both runc and node_exporter will be blocked, which will cause
the cluster worker node to become unschedulable.

For example, the nodejs runtime reads
/sys/devices/system/cpu/cpu%u/cpufreq/scaling_cur_freq [1]

Jun 22 21:43:17 kernel: task:node state:D stack:0 pid:2621530 ppid:1062663 flags:0x00004002
Jun 22 21:43:17 kernel: Call Trace:
Jun 22 21:43:17 kernel: <TASK>
Jun 22 21:43:17 kernel: __schedule+0x278/0x740
Jun 22 21:43:17 kernel: schedule+0x5a/0xd0
Jun 22 21:43:17 kernel: schedule_preempt_disabled+0x11/0x20
Jun 22 21:43:17 kernel: __mutex_lock.constprop.0+0x376/0x680
Jun 22 21:43:17 kernel: ? kvm_sched_clock_read+0xd/0x20
Jun 22 21:43:17 kernel: mem_cgroup_out_of_memory+0x53/0x150
Jun 22 21:43:17 kernel: try_charge_memcg+0x682/0x7d0
Jun 22 21:43:17 kernel: ? memcg_list_lru_alloc+0xa7/0x330
Jun 22 21:43:17 kernel: obj_cgroup_charge+0x70/0x170
Jun 22 21:43:17 kernel: slab_pre_alloc_hook.constprop.0+0xb6/0x1d0
Jun 22 21:43:17 kernel: ? alloc_inode+0x59/0xc0
Jun 22 21:43:17 kernel: kmem_cache_alloc_lru+0x54/0x2c0
Jun 22 21:43:17 kernel: alloc_inode+0x59/0xc0
Jun 22 21:43:17 kernel: iget_locked+0xe3/0x220
Jun 22 21:43:17 kernel: kernfs_get_inode+0x18/0x110
Jun 22 21:43:17 kernel: kernfs_iop_lookup+0x74/0xd0
Jun 22 21:43:17 kernel: __lookup_slow+0x82/0x130
Jun 22 21:43:17 kernel: walk_component+0xdb/0x150
Jun 22 21:43:17 kernel: link_path_walk.part.0.constprop.0+0x240/0x380
Jun 22 21:43:17 kernel: ? path_init+0x293/0x3d0
Jun 22 21:43:17 kernel: path_openat+0x85/0x2a0
Jun 22 21:43:17 kernel: ? seq_printf+0x8e/0xb0
Jun 22 21:43:17 kernel: do_filp_open+0xb4/0x160
Jun 22 21:43:17 kernel: ? __check_object_size.part.0+0x5e/0x130
Jun 22 21:43:17 kernel: do_sys_openat2+0x91/0xc0
Jun 22 21:43:17 kernel: __x64_sys_openat+0x53/0xa0
Jun 22 21:43:17 kernel: do_syscall_64+0x35/0x80
Jun 22 21:43:17 kernel: entry_SYSCALL_64_after_hwframe+0x78/0xe2

A rwsem writer is involved:

Jun 22 21:46:17 kernel: task:kworker/u232:0 state:D stack:0 pid:2886179 ppid:2 flags:0x00004000
Jun 22 21:46:17 kernel: Workqueue: netns cleanup_net
Jun 22 21:46:17 kernel: Call Trace:
Jun 22 21:46:17 kernel: <TASK>
Jun 22 21:46:17 kernel: __schedule+0x278/0x740
Jun 22 21:46:17 kernel: ? psi_group_change+0x226/0x3d0
Jun 22 21:46:17 kernel: schedule+0x5a/0xd0
Jun 22 21:46:17 kernel: schedule_preempt_disabled+0x11/0x20
Jun 22 21:46:17 kernel: rwsem_down_write_slowpath+0x1e2/0x4f0
Jun 22 21:46:17 kernel: down_write+0x57/0x60
Jun 22 21:46:17 kernel: kernfs_remove_by_name_ns+0x38/0xc0
Jun 22 21:46:17 kernel: remove_files+0x2b/0x70
Jun 22 21:46:17 kernel: sysfs_remove_group+0x38/0x80
Jun 22 21:46:17 kernel: sysfs_remove_groups+0x29/0x50
Jun 22 21:46:17 kernel: ib_free_port_attrs+0x92/0x170 [ib_core]
Jun 22 21:46:17 kernel: rdma_dev_exit_net+0x117/0x1d0 [ib_core]
Jun 22 21:46:17 kernel: ops_exit_list+0x30/0x70
Jun 22 21:46:17 kernel: cleanup_net+0x273/0x430
Jun 22 21:46:17 kernel: process_one_work+0x18a/0x3a0
Jun 22 21:46:17 kernel: worker_thread+0x277/0x3a0
Jun 22 21:46:17 kernel: ? __pfx_worker_thread+0x10/0x10
Jun 22 21:46:17 kernel: kthread+0xe1/0x110
Jun 22 21:46:17 kernel: ? __pfx_kthread+0x10/0x10
Jun 22 21:46:17 kernel: ret_from_fork+0x2d/0x50
Jun 22 21:46:17 kernel: ? __pfx_kthread+0x10/0x10
Jun 22 21:46:17 kernel: ret_from_fork_asm+0x1b/0x30
Jun 22 21:46:17 kernel: </TASK>

In our real-world case, thousands of nodejs processes hit the cgroup memory
limit and sleep on the OOM path, runc gets blocked, and consequently the node
remains in an unschedulable state for many hours.

Link:
- [1] https://github.com/nodejs/node/blob/v22.19.0/deps/uv/src/unix/linux.c#L1889

> But in that work, no one could find a real benchmark or use case that
> anyone could even notice this type of thing. What do you have that
> shows this?
> thanks,
>
> greg k-h