Re: [RFC PATCH] kernfs: release kernfs_mutex before the inode allocation

From: Ian Kent

Date: Tue Jun 30 2026 - 10:35:55 EST


On 30/6/26 11:26, Wenwu Hou wrote:
On Tue, Nov 16, 2021 at 08:49:46PM +0100, Greg Kroah-Hartman wrote:
On Tue, Nov 16, 2021 at 11:43:17AM -0800, Minchan Kim wrote:
The kernfs implementation has big lock granularity(kernfs_rwsem) so
every kernfs-based(e.g., sysfs, cgroup, dmabuf) fs are able to compete
the lock. Thus, if one of userspace goes the sleep under holding
the lock for a long time, rest of them should wait it. A example is
the holder goes direct reclaim with the lock since it needs memory
allocation. Let's fix it at common technique that release the lock
and then allocate the memory. Fortunately, kernfs looks like have
an refcount so I hope it's fine.

Signed-off-by: Minchan Kim <minchan@xxxxxxxxxx>
---
fs/kernfs/dir.c | 14 +++++++++++---
fs/kernfs/inode.c | 2 +-
fs/kernfs/kernfs-internal.h | 1 +
3 files changed, 13 insertions(+), 4 deletions(-)
What workload hits this lock to cause it to be noticable?

Hi Greg, ;)


Sadly, perhaps by coincidence, I have a case that looks almost identical

to this so I'm pretty interested in the discussion. This scenario didn't

immediately occur to me looking at the vmcores but, yes, I think this is

a serious (potential) problem. But I don't have enough information to put

together a causal scenario either which is a problem.


The type of systems that show these sort of problem are extremely heavy

kernfs users (obviously) so sleeping, even under a read lock, can be a

problem, again obviously.


I'm not sure about dropping and re-taking the lock, although at first sight

it looks like it would be ok, I just don't like this sort of execution

continuity break but a refactor that avoids memory allocations under a lock

seems like a very sensible thing to do. I don't yet have a feel for how far

to go with it either. But given this code path is a very hot in the systems

for which I get these problem reports the inode lookup op might be enough.



There was a bunch of recent work in this area to make this much more
fine-grained, and the theoritical benchmarks that people created (adding
10s of thousands of scsi disks at boot time) have gotten better.
Hi,

By 2026, the kernfs_rwsem has been split into per-fs. But the problem still
exists.

Yes, but for my case I'm not sue how much of that has made it into kernels

people with these systems are using. I have done a fair bit to add these

improvements but change is slow in larger environments.



In a k8s cluster environment, there are a lot of processes reading /sys, for example:

Container runtime: runc
Prometheus Exporter: node_exporter

Indeed, these and a few other friends too, thend to show up again and again.



If a tenant's container processes read /sys and sleep on the OOM (or direct
reclaim) path, both runc and node_exporter will be blocked, which will cause
the cluster worker node to become unschedulable.

In the case I saw it looked like there was a directory remove going on and

there was about 7M negative dentries, roughly double the positives, so I've

been chasing info. to work out if the number of directory children was large

enough to cause this.


Perhaps, as demand grows over time, we will see other scenarios beside the

allocation delays you have seen.



For example, the nodejs runtime reads
/sys/devices/system/cpu/cpu%u/cpufreq/scaling_cur_freq [1]

Jun 22 21:43:17 kernel: task:node state:D stack:0 pid:2621530 ppid:1062663 flags:0x00004002
Jun 22 21:43:17 kernel: Call Trace:
Jun 22 21:43:17 kernel: <TASK>
Jun 22 21:43:17 kernel: __schedule+0x278/0x740
Jun 22 21:43:17 kernel: schedule+0x5a/0xd0
Jun 22 21:43:17 kernel: schedule_preempt_disabled+0x11/0x20
Jun 22 21:43:17 kernel: __mutex_lock.constprop.0+0x376/0x680
Jun 22 21:43:17 kernel: ? kvm_sched_clock_read+0xd/0x20
Jun 22 21:43:17 kernel: mem_cgroup_out_of_memory+0x53/0x150
Jun 22 21:43:17 kernel: try_charge_memcg+0x682/0x7d0
Jun 22 21:43:17 kernel: ? memcg_list_lru_alloc+0xa7/0x330
Jun 22 21:43:17 kernel: obj_cgroup_charge+0x70/0x170
Jun 22 21:43:17 kernel: slab_pre_alloc_hook.constprop.0+0xb6/0x1d0
Jun 22 21:43:17 kernel: ? alloc_inode+0x59/0xc0
Jun 22 21:43:17 kernel: kmem_cache_alloc_lru+0x54/0x2c0
Jun 22 21:43:17 kernel: alloc_inode+0x59/0xc0
Jun 22 21:43:17 kernel: iget_locked+0xe3/0x220
Jun 22 21:43:17 kernel: kernfs_get_inode+0x18/0x110
Jun 22 21:43:17 kernel: kernfs_iop_lookup+0x74/0xd0
Jun 22 21:43:17 kernel: __lookup_slow+0x82/0x130
Jun 22 21:43:17 kernel: walk_component+0xdb/0x150
Jun 22 21:43:17 kernel: link_path_walk.part.0.constprop.0+0x240/0x380
Jun 22 21:43:17 kernel: ? path_init+0x293/0x3d0
Jun 22 21:43:17 kernel: path_openat+0x85/0x2a0
Jun 22 21:43:17 kernel: ? seq_printf+0x8e/0xb0
Jun 22 21:43:17 kernel: do_filp_open+0xb4/0x160
Jun 22 21:43:17 kernel: ? __check_object_size.part.0+0x5e/0x130
Jun 22 21:43:17 kernel: do_sys_openat2+0x91/0xc0
Jun 22 21:43:17 kernel: __x64_sys_openat+0x53/0xa0
Jun 22 21:43:17 kernel: do_syscall_64+0x35/0x80
Jun 22 21:43:17 kernel: entry_SYSCALL_64_after_hwframe+0x78/0xe2

A rwsem writer is involved:

Jun 22 21:46:17 kernel: task:kworker/u232:0 state:D stack:0 pid:2886179 ppid:2 flags:0x00004000
Jun 22 21:46:17 kernel: Workqueue: netns cleanup_net
Jun 22 21:46:17 kernel: Call Trace:
Jun 22 21:46:17 kernel: <TASK>
Jun 22 21:46:17 kernel: __schedule+0x278/0x740
Jun 22 21:46:17 kernel: ? psi_group_change+0x226/0x3d0
Jun 22 21:46:17 kernel: schedule+0x5a/0xd0
Jun 22 21:46:17 kernel: schedule_preempt_disabled+0x11/0x20
Jun 22 21:46:17 kernel: rwsem_down_write_slowpath+0x1e2/0x4f0
Jun 22 21:46:17 kernel: down_write+0x57/0x60
Jun 22 21:46:17 kernel: kernfs_remove_by_name_ns+0x38/0xc0
Jun 22 21:46:17 kernel: remove_files+0x2b/0x70
Jun 22 21:46:17 kernel: sysfs_remove_group+0x38/0x80
Jun 22 21:46:17 kernel: sysfs_remove_groups+0x29/0x50
Jun 22 21:46:17 kernel: ib_free_port_attrs+0x92/0x170 [ib_core]
Jun 22 21:46:17 kernel: rdma_dev_exit_net+0x117/0x1d0 [ib_core]
Jun 22 21:46:17 kernel: ops_exit_list+0x30/0x70
Jun 22 21:46:17 kernel: cleanup_net+0x273/0x430
Jun 22 21:46:17 kernel: process_one_work+0x18a/0x3a0
Jun 22 21:46:17 kernel: worker_thread+0x277/0x3a0
Jun 22 21:46:17 kernel: ? __pfx_worker_thread+0x10/0x10
Jun 22 21:46:17 kernel: kthread+0xe1/0x110
Jun 22 21:46:17 kernel: ? __pfx_kthread+0x10/0x10
Jun 22 21:46:17 kernel: ret_from_fork+0x2d/0x50
Jun 22 21:46:17 kernel: ? __pfx_kthread+0x10/0x10
Jun 22 21:46:17 kernel: ret_from_fork_asm+0x1b/0x30
Jun 22 21:46:17 kernel: </TASK>

This looks very much like the senario I've seen recently, that remove looked

like a sub-tree removal to me at the time so the question I have is what's

your dentry-state like?


If I could get our customer to do so I'd be asking for a locking status report

but I wouldn't want to use a debug kernel build, just a build the lock reporting

enabled ... that;s just not going to happen.




In our real-world case, thousands of nodejs processes hit the cgroup memory
limit and sleep on the OOM path, runc gets blocked, and consequently the node
remains in an unschedulable state for many hours.

Right, that does sound like a different workload to my case so maybe your not

alone, ;)




Link:
- [1] https://github.com/nodejs/node/blob/v22.19.0/deps/uv/src/unix/linux.c#L1889

But in that work, no one could find a real benchmark or use case that
anyone could even notice this type of thing. What do you have that
shows this?

I wish I could be more helpful but I too am guilty of not being able to

get enough info. and have no reproducer either.


Ian