On 2025/1/30 00:53, Rik van Riel wrote:
On Wed, 29 Jan 2025 08:36:12 -0800
"Paul E. McKenney" <paulmck@xxxxxxxxxx> wrote:
On Wed, Jan 29, 2025 at 11:14:29AM -0500, Rik van Riel wrote:
Paul, does this look like it could do the trick,
or do we need something else to make RCU freeing
happy again?
I don't claim to fully understand the issue, but this would prevent
any RCU grace periods starting subsequently from completing. It would
not prevent RCU callbacks from being invoked for RCU grace periods that
started earlier.
So it won't prevent RCU callbacks from being invoked.
That makes things clear! I guess we need a different approach.
Qi, does the patch below resolve the regression for you?
---8<---
From 5de4fa686fca15678a7e0a186852f921166854a3 Mon Sep 17 00:00:00 2001
From: Rik van Riel <riel@xxxxxxxxxxx>
Date: Wed, 29 Jan 2025 10:51:51 -0500
Subject: [PATCH 2/2] mm,rcu: prevent RCU callbacks from running with pcp lock
held
Enabling MMU_GATHER_RCU_TABLE_FREE can create contention on the
zone->lock. This turns out to be because in some configurations
RCU callbacks are called when IRQs are re-enabled inside
rmqueue_bulk, while the CPU is still holding the per-cpu pages lock.
That results in the RCU callbacks being unable to grab the
PCP lock, and taking the slow path with the zone->lock for
each item freed.
Speed things up by blocking RCU callbacks while holding the
PCP lock.
Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx>
Suggested-by: Paul McKenney <paulmck@xxxxxxxxxx>
Reported-by: Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx>
---
mm/page_alloc.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e469c7ef9a4..73e334f403fd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -94,11 +94,15 @@ static DEFINE_MUTEX(pcp_batch_high_lock);
#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT)
/*
- * On SMP, spin_trylock is sufficient protection.
+ * On SMP, spin_trylock is sufficient protection against recursion.
* On PREEMPT_RT, spin_trylock is equivalent on both SMP and UP.
+ *
+ * Block softirq execution to prevent RCU frees from running in softirq
+ * context while this CPU holds the PCP lock, which could result in a whole
+ * bunch of frees contending on the zone->lock.
*/
-#define pcp_trylock_prepare(flags) do { } while (0)
-#define pcp_trylock_finish(flag) do { } while (0)
+#define pcp_trylock_prepare(flags) local_bh_disable()
+#define pcp_trylock_finish(flag) local_bh_enable()
I just tested this, and it doesn't seem to improve much:
root@debian:~# stress-ng --timeout 60 --times --verify --metrics --no-rand-seed --mmapaddr 64
stress-ng: info: [671] dispatching hogs: 64 mmapaddr
stress-ng: info: [671] successful run completed in 60.07s (1 min, 0.07 secs)
stress-ng: info: [671] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
stress-ng: info: [671] (secs) (secs) (secs) (real time) (usr+sys time)
stress-ng: info: [671] mmapaddr 19803127 60.01 235.20 1146.76 330007.29 14329.74
stress-ng: info: [671] for a 60.07s run time:
stress-ng: info: [671] 1441.59s available CPU time
stress-ng: info: [671] 235.57s user time ( 16.34%)
stress-ng: info: [671] 1147.20s system time ( 79.58%)
stress-ng: info: [671] 1382.77s total time ( 95.92%)
stress-ng: info: [671] load average: 41.42 11.91 4.10
The _raw_spin_unlock_irqrestore hotspot still exists:
15.87% [kernel] [k] _raw_spin_unlock_irqrestore
9.18% [kernel] [k] clear_page_rep
7.03% [kernel] [k] do_syscall_64
3.67% [kernel] [k] _raw_spin_lock
3.28% [kernel] [k] __slab_free
2.03% [kernel] [k] rcu_cblist_dequeue
1.98% [kernel] [k] flush_tlb_mm_range
1.88% [kernel] [k] lruvec_stat_mod_folio.part.131
1.85% [kernel] [k] get_page_from_freelist
1.64% [kernel] [k] kmem_cache_alloc_noprof
1.61% [kernel] [k] tlb_remove_table_rcu
1.39% [kernel] [k] mtree_range_walk
1.36% [kernel] [k] __alloc_frozen_pages_noprof
1.27% [kernel] [k] pmd_install
1.24% [kernel] [k] memcpy_orig
1.23% [kernel] [k] __call_rcu_common.constprop.77
1.17% [kernel] [k] free_pgd_range
1.15% [kernel] [k] pte_alloc_one
The call stack is as follows:
bpftrace -e 'k:_raw_spin_unlock_irqrestore {@[kstack,comm]=count();} interval:s:1 {exit();}'
@[
_raw_spin_unlock_irqrestore+5
hrtimer_interrupt+289
__sysvec_apic_timer_interrupt+85
sysvec_apic_timer_interrupt+108
asm_sysvec_apic_timer_interrupt+26
tlb_remove_table_rcu+48
rcu_do_batch+424
rcu_core+401
handle_softirqs+204
irq_exit_rcu+208
sysvec_apic_timer_interrupt+61
asm_sysvec_apic_timer_interrupt+26
, stress-ng-mmapa]: 8
The tlb_remove_table_rcu() is called very rarely, so I guess the
PCP cache is basically empty at this time, resulting in the following
call stack: