Re: [linus:master] [x86] 4817f70c25: stress-ng.mmapaddr.ops_per_sec 63.0% regression

From: Qi Zheng
Date: Wed Jan 29 2025 - 12:34:18 EST




On 2025/1/30 00:53, Rik van Riel wrote:
On Wed, 29 Jan 2025 08:36:12 -0800
"Paul E. McKenney" <paulmck@xxxxxxxxxx> wrote:
On Wed, Jan 29, 2025 at 11:14:29AM -0500, Rik van Riel wrote:

Paul, does this look like it could do the trick,
or do we need something else to make RCU freeing
happy again?

I don't claim to fully understand the issue, but this would prevent
any RCU grace periods starting subsequently from completing. It would
not prevent RCU callbacks from being invoked for RCU grace periods that
started earlier.

So it won't prevent RCU callbacks from being invoked.

That makes things clear! I guess we need a different approach.

Qi, does the patch below resolve the regression for you?

---8<---

From 5de4fa686fca15678a7e0a186852f921166854a3 Mon Sep 17 00:00:00 2001
From: Rik van Riel <riel@xxxxxxxxxxx>
Date: Wed, 29 Jan 2025 10:51:51 -0500
Subject: [PATCH 2/2] mm,rcu: prevent RCU callbacks from running with pcp lock
held

Enabling MMU_GATHER_RCU_TABLE_FREE can create contention on the
zone->lock. This turns out to be because in some configurations
RCU callbacks are called when IRQs are re-enabled inside
rmqueue_bulk, while the CPU is still holding the per-cpu pages lock.

That results in the RCU callbacks being unable to grab the
PCP lock, and taking the slow path with the zone->lock for
each item freed.

Speed things up by blocking RCU callbacks while holding the
PCP lock.

Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx>
Suggested-by: Paul McKenney <paulmck@xxxxxxxxxx>
Reported-by: Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx>
---
mm/page_alloc.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e469c7ef9a4..73e334f403fd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -94,11 +94,15 @@ static DEFINE_MUTEX(pcp_batch_high_lock);
#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT)
/*
- * On SMP, spin_trylock is sufficient protection.
+ * On SMP, spin_trylock is sufficient protection against recursion.
* On PREEMPT_RT, spin_trylock is equivalent on both SMP and UP.
+ *
+ * Block softirq execution to prevent RCU frees from running in softirq
+ * context while this CPU holds the PCP lock, which could result in a whole
+ * bunch of frees contending on the zone->lock.
*/
-#define pcp_trylock_prepare(flags) do { } while (0)
-#define pcp_trylock_finish(flag) do { } while (0)
+#define pcp_trylock_prepare(flags) local_bh_disable()
+#define pcp_trylock_finish(flag) local_bh_enable()

I just tested this, and it doesn't seem to improve much:

root@debian:~# stress-ng --timeout 60 --times --verify --metrics --no-rand-seed --mmapaddr 64
stress-ng: info: [671] dispatching hogs: 64 mmapaddr
stress-ng: info: [671] successful run completed in 60.07s (1 min, 0.07 secs)
stress-ng: info: [671] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s
stress-ng: info: [671] (secs) (secs) (secs) (real time) (usr+sys time)
stress-ng: info: [671] mmapaddr 19803127 60.01 235.20 1146.76 330007.29 14329.74
stress-ng: info: [671] for a 60.07s run time:
stress-ng: info: [671] 1441.59s available CPU time
stress-ng: info: [671] 235.57s user time ( 16.34%)
stress-ng: info: [671] 1147.20s system time ( 79.58%)
stress-ng: info: [671] 1382.77s total time ( 95.92%)
stress-ng: info: [671] load average: 41.42 11.91 4.10

The _raw_spin_unlock_irqrestore hotspot still exists:

15.87% [kernel] [k] _raw_spin_unlock_irqrestore
9.18% [kernel] [k] clear_page_rep
7.03% [kernel] [k] do_syscall_64
3.67% [kernel] [k] _raw_spin_lock
3.28% [kernel] [k] __slab_free
2.03% [kernel] [k] rcu_cblist_dequeue
1.98% [kernel] [k] flush_tlb_mm_range
1.88% [kernel] [k] lruvec_stat_mod_folio.part.131
1.85% [kernel] [k] get_page_from_freelist
1.64% [kernel] [k] kmem_cache_alloc_noprof
1.61% [kernel] [k] tlb_remove_table_rcu
1.39% [kernel] [k] mtree_range_walk
1.36% [kernel] [k] __alloc_frozen_pages_noprof
1.27% [kernel] [k] pmd_install
1.24% [kernel] [k] memcpy_orig
1.23% [kernel] [k] __call_rcu_common.constprop.77
1.17% [kernel] [k] free_pgd_range
1.15% [kernel] [k] pte_alloc_one

The call stack is as follows:

bpftrace -e 'k:_raw_spin_unlock_irqrestore {@[kstack,comm]=count();} interval:s:1 {exit();}'

@[
_raw_spin_unlock_irqrestore+5
hrtimer_interrupt+289
__sysvec_apic_timer_interrupt+85
sysvec_apic_timer_interrupt+108
asm_sysvec_apic_timer_interrupt+26
tlb_remove_table_rcu+48
rcu_do_batch+424
rcu_core+401
handle_softirqs+204
irq_exit_rcu+208
sysvec_apic_timer_interrupt+61
asm_sysvec_apic_timer_interrupt+26
, stress-ng-mmapa]: 8

The tlb_remove_table_rcu() is called very rarely, so I guess the
PCP cache is basically empty at this time, resulting in the following
call stack:

@[
_raw_spin_unlock_irqrestore+5
__put_partials+218
kmem_cache_free+860
rcu_do_batch+424
rcu_core+401
handle_softirqs+204
do_softirq.part.23+59
__local_bh_enable_ip+91
get_page_from_freelist+399
__alloc_frozen_pages_noprof+364
alloc_pages_mpol+123
alloc_pages_noprof+14
get_free_pages_noprof+17
__x64_sys_mincore+141
do_syscall_64+98
entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 776
@[
_raw_spin_unlock_irqrestore+5
get_page_from_freelist+2044
__alloc_frozen_pages_noprof+364
alloc_pages_mpol+123
alloc_pages_noprof+14
pte_alloc_one+30
__pte_alloc+42
move_page_tables+2285
move_vma+472
__do_sys_mremap+1759
do_syscall_64+98
entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 1214
@[
_raw_spin_unlock_irqrestore+5
get_page_from_freelist+2044
__alloc_frozen_pages_noprof+364
alloc_pages_mpol+123
alloc_pages_noprof+14
get_free_pages_noprof+17
tlb_remove_table+82
free_pgd_range+655
free_pgtables+601
vms_clear_ptes.part.39+255
vms_complete_munmap_vmas+311
do_vmi_align_munmap+419
do_vmi_munmap+195
move_vma+802
__do_sys_mremap+1759
do_syscall_64+98
entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 1631
@[
_raw_spin_unlock_irqrestore+5
get_page_from_freelist+2044
__alloc_frozen_pages_noprof+364
alloc_pages_mpol+123
alloc_pages_noprof+14
get_free_pages_noprof+17
tlb_remove_table+82
free_pgd_range+655
free_pgtables+601
vms_clear_ptes.part.39+255
vms_complete_munmap_vmas+311
do_vmi_align_munmap+419
do_vmi_munmap+195
__vm_munmap+177
__x64_sys_munmap+27
do_syscall_64+98
entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 1672
@[
_raw_spin_unlock_irqrestore+5
get_page_from_freelist+2044
__alloc_frozen_pages_noprof+364
alloc_pages_mpol+123
alloc_pages_noprof+14
__pmd_alloc+52
__handle_mm_fault+1265
handle_mm_fault+195
__get_user_pages+690
populate_vma_page_range+127
__mm_populate+159
vm_mmap_pgoff+329
do_syscall_64+98
entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 2042
@[
_raw_spin_unlock_irqrestore+5
get_partial_node.part.102+378
___slab_alloc.part.103+1180
__slab_alloc.isra.104+34
kmem_cache_alloc_noprof+192
mas_alloc_nodes+358
mas_store_gfp+183
do_vmi_align_munmap+398
do_vmi_munmap+195
__vm_munmap+177
__x64_sys_munmap+27
do_syscall_64+98
entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 2219
@[
_raw_spin_unlock_irqrestore+5
get_page_from_freelist+2044
__alloc_frozen_pages_noprof+364
alloc_pages_mpol+123
alloc_pages_noprof+14
pte_alloc_one+30
__pte_alloc+42
do_pte_missing+2493
__handle_mm_fault+1914
handle_mm_fault+195
__get_user_pages+690
populate_vma_page_range+127
__mm_populate+159
vm_mmap_pgoff+329
do_syscall_64+98
entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 2657
@[
_raw_spin_unlock_irqrestore+5
get_page_from_freelist+2044
__alloc_frozen_pages_noprof+364
alloc_pages_mpol+123
alloc_pages_noprof+14
get_free_pages_noprof+17
__x64_sys_mincore+141
do_syscall_64+98
entry_SYSCALL_64_after_hwframe+118
, stress-ng-mmapa]: 5734

#else
/* UP spin_trylock always succeeds so disable IRQs to prevent re-entrancy. */