Re: [PATCH v2 5/5] locking/rwsem: Remove reader optimistic spinning

From: David Woodhouse

Date: Tue Mar 24 2026 - 10:41:38 EST


On Tue, 2025-05-27 at 09:08 +0000, Krcka, Tomas wrote:
> Hi Waiman,
>
> I recently discovered that this patch ( commit 617f3ef95177
> ("locking/rwsem: Remove reader optimistic spinning") ) results in up to 50% performance drop in
> real life scenarios - more processes are trying to perform operation on top of sysfs (read and write).
> Reverting of the patch gained the performance back. My suggestion would be to revert the patch
> from mainline as well.
>
> I notice that degradation initially due to a workload where more processes are  accessing
> same sysfs seeing up to 50% performance drop - time spent to proceed the test.
> After investigation, I traced the root cause back to two related changes:
> first, when kernfs switched from mutex to rwsem (commit 7ba0273b2f34 ("kernfs: switch kernfs to use an rwsem")),
> and ultimately to the removal of reader optimistic spinning.
>
> The lock contention tracing shows a clear pattern: process accessing a kernfs_dentry_node taking read semaphore is
> now forced to take the slow path since the lock is taken by another process operating on same node and needs write semaphore.
> (See below ftrace for the exact operations.)
>
> This contrasts with the previous behavior where optimistic spinning prevented such
> situations.
>
> I have confirmed this behavior across multiple kernel versions (6.14.4, 6.8.12, 6.6.80), as well if backporting
> the mentioned commits to older versions (specifically v5.10)
> While the real-world impact was observed on AArch64, I've successfully reproduced
> the core issue using our test case on both AArch64 (192 vCPUs) and x86 Ice Lake (128 vCPUs) systems.
>
> While I identified this through sysfs (kernfs) operations, I believe this regression
> could affect other subsystems using reader-writer semaphores with similar access
> patterns.
>
> ftrace with the commit showing this pattern:
>
> """"
> userspace_bench-6796    [000] .....  2328.023515: contention_begin: 00000000ca66c48e (flags=READ)
> ^^^ - waiting for lock and now all next threads will be waiting
>
> userspace_bench-6796    [000] d....  2328.023518: sched_switch: prev_comm=userspace_bench prev_pid=6796 prev_prio=120 prev_state=D ==> next_comm=userspace_bench next_pid=6798 next_prio=120
> userspace_bench-6806    [009] d....  2328.023524: sched_switch: prev_comm=userspace_bench prev_pid=6806 prev_prio=120 prev_state=R+ ==> next_comm=migration/9 next_pid=70 next_prio=0
> userspace_bench-6804    [004] d....  2328.023532: contention_begin: 00000000ca66c48e (flags=WRITE)
> userspace_bench-6805    [005] d....  2328.023532: contention_begin: 00000000ca66c48e (flags=WRITE)
> userspace_bench-6804    [004] d....  2328.023533: sched_switch: prev_comm=userspace_bench prev_pid=6804 prev_prio=120 prev_state=D ==> next_comm=swapper/4 next_pid=0 next_prio=120
> userspace_bench-6797    [001] .....  2328.023534: contention_begin: 00000000ca66c48e (flags=READ)
>
> ... [cut] ....
>
> userspace_bench-6807    [007] .....  2328.023661: contention_begin: 00000000ca66c48e (flags=READ)
> userspace_bench-6807    [007] d....  2328.023666: sched_switch: prev_comm=userspace_bench prev_pid=6807 prev_prio=120 prev_state=D ==> next_comm=swapper/7 next_pid=0 next_prio=120
> userspace_bench-6813    [013] .....  2328.023669: contention_begin: 00000000ca66c48e (flags=READ)
> userspace_bench-6815    [015] .....  2328.023673: contention_begin: 00000000ca66c48e (flags=READ)
> userspace_bench-6813    [013] d....  2328.023674: sched_switch: prev_comm=userspace_bench prev_pid=6813 prev_prio=120 prev_state=D ==> next_comm=swapper/13 next_pid=0 next_prio=120
> userspace_bench-6815    [015] d....  2328.023675: sched_switch: prev_comm=userspace_bench prev_pid=6815 prev_prio=120 prev_state=D ==> next_comm=swapper/15 next_pid=0 next_prio=120
> userspace_bench-6803    [003] .....  2328.026170: contention_begin: 00000000ca66c48e (flags=READ)
> userspace_bench-6803    [003] d....  2328.026171: sched_switch: prev_comm=userspace_bench prev_pid=6803 prev_prio=120 prev_state=D ==> next_comm=swapper/3 next_pid=0 next_prio=120
> userspace_bench-6798    [000] d....  2328.027162: sched_switch: prev_comm=userspace_bench prev_pid=6798 prev_prio=120 prev_state=R ==> next_comm=userspace_bench next_pid=6800 next_prio=120
> userspace_bench-6799    [001] d....  2328.027162: sched_switch: prev_comm=userspace_bench prev_pid=6799 prev_prio=120 prev_state=R ==> next_comm=userspace_bench next_pid=6801 next_prio=120
> userspace_bench-6800    [000] .....  2328.027165: contention_begin: 00000000ca66c48e (flags=READ)
> userspace_bench-6801    [001] .....  2328.027166: contention_begin: 00000000ca66c48e (flags=READ)
> userspace_bench-6800    [000] d....  2328.027167: sched_switch: prev_comm=userspace_bench prev_pid=6800 prev_prio=120 prev_state=D ==> next_comm=userspace_bench next_pid=6796 next_prio=120
> userspace_bench-6801    [001] d....  2328.027167: sched_switch: prev_comm=userspace_bench prev_pid=6801 prev_prio=120 prev_state=D ==> next_comm=userspace_bench next_pid=6799 next_prio=120
> userspace_bench-6796    [000] .....  2328.027168: contention_end: 00000000ca66c48e (ret=0)
> ^^^^ -- here the READer got the lock - it took ~3ms to get the lock
> """"
> without the commit we don't see any of the above waiting and the processes are only in the optimistic spinning.
>
>
> In our situation the writer is doing this operation
> """
> userspace_bench-6800     [031] .....     0.251700: contention_begin: 000000007a4d517c (ret=0)
> userspace_bench-6800     [031] .....     0.251700: <stack trace>
> => __traceiter_contention_fastpath
> => rwsem_down_write_slowpath
> => down_write
> => kernfs_activate
> => kernfs_add_one
> => __kernfs_create_file
> => sysfs_add_file_mode_ns
> => internal_create_group
> => internal_create_groups.part.6
> => sysfs_create_groups
> """
>
> and the reader is doing this - both are taking the same semaphore
> """
> userspace_bench-6801     [095] .....     0.251700: contention_begin: 000000007a4d517c (flags=READ)
>             userspace_bench-6801     [095] .....     0.251700: <stack trace>
> => __traceiter_contention_begin
> => rwsem_down_read_slowpath
> => down_read
> => kernfs_dop_revalidate
> => lookup_fast
> => walk_component
> => link_path_walk.part.74
> => path_openat
> => do_filp_open
> => do_sys_openat2
> => do_sys_open
> """
>
> ----
>
> To help investigate this issue, I've created a minimal reproduction case:
> 1. Test repository: https://github.com/tomaskrcka/sysfs_bench
> 2. The test consists of:
>   - A kernel module that creates sysfs interface and handles file operations
>   - A userspace application that spawns writers (matching CPU core count) and readers
>
> Using the test case on kernel 6.14.4, I collected the following measurements
> (100 samples each):
>
> On AArch64 c8g (192 vCPUs):
> - Without revert:
>  * Avg: 3.50s (min: 3.39s, max: 3.63s, p99: 3.59s)
> - With revert:
>  * Avg: 2.70s (min: 2.65s, max: 3.10s, p99: 2.83s)
>  * ~23% improvement
>
> On x86 Ice Lake m6i (128 vCPUs):
> - Without revert:
>  * Avg: 6.71s (min: 6.61s, max: 7.60s, p99: 6.82s)
> - With revert:
>  * Avg: 6.28s (min: 5.89s, max: 7.52s, p99: 6.65s)
>  * ~6% improvement
>
> Could you take a look on that and let me know your thoughts ?
>
> I’m happy to help with further investigation.

Any word on this? We're still carrying a patch to revert it because of
the regression (as well as patches to revert the change of kernfs to
rwsem because that was a massive performance regression too).

Attachment: smime.p7s
Description: S/MIME cryptographic signature