Re: [linus:master] [x86] 4817f70c25: stress-ng.mmapaddr.ops_per_sec 63.0% regression

From: Qi Zheng
Date: Wed Jan 29 2025 - 12:55:36 EST




On 2025/1/30 01:33, Qi Zheng wrote:


On 2025/1/30 00:53, Rik van Riel wrote:
On Wed, 29 Jan 2025 08:36:12 -0800
"Paul E. McKenney" <paulmck@xxxxxxxxxx> wrote:
On Wed, Jan 29, 2025 at 11:14:29AM -0500, Rik van Riel wrote:

Paul, does this look like it could do the trick,
or do we need something else to make RCU freeing
happy again?

I don't claim to fully understand the issue, but this would prevent
any RCU grace periods starting subsequently from completing.  It would
not prevent RCU callbacks from being invoked for RCU grace periods that
started earlier.

So it won't prevent RCU callbacks from being invoked.

That makes things clear! I guess we need a different approach.

Qi, does the patch below resolve the regression for you?

---8<---

 From 5de4fa686fca15678a7e0a186852f921166854a3 Mon Sep 17 00:00:00 2001
From: Rik van Riel <riel@xxxxxxxxxxx>
Date: Wed, 29 Jan 2025 10:51:51 -0500
Subject: [PATCH 2/2] mm,rcu: prevent RCU callbacks from running with pcp lock
  held

Enabling MMU_GATHER_RCU_TABLE_FREE can create contention on the
zone->lock.  This turns out to be because in some configurations
RCU callbacks are called when IRQs are re-enabled inside
rmqueue_bulk, while the CPU is still holding the per-cpu pages lock.

That results in the RCU callbacks being unable to grab the
PCP lock, and taking the slow path with the zone->lock for
each item freed.

Speed things up by blocking RCU callbacks while holding the
PCP lock.

Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx>
Suggested-by: Paul McKenney <paulmck@xxxxxxxxxx>
Reported-by: Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx>
---
  mm/page_alloc.c | 10 +++++++---
  1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e469c7ef9a4..73e334f403fd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -94,11 +94,15 @@ static DEFINE_MUTEX(pcp_batch_high_lock);
  #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT_RT)
  /*
- * On SMP, spin_trylock is sufficient protection.
+ * On SMP, spin_trylock is sufficient protection against recursion.
   * On PREEMPT_RT, spin_trylock is equivalent on both SMP and UP.
+ *
+ * Block softirq execution to prevent RCU frees from running in softirq
+ * context while this CPU holds the PCP lock, which could result in a whole
+ * bunch of frees contending on the zone->lock.
   */
-#define pcp_trylock_prepare(flags)    do { } while (0)
-#define pcp_trylock_finish(flag)    do { } while (0)
+#define pcp_trylock_prepare(flags)    local_bh_disable()
+#define pcp_trylock_finish(flag)    local_bh_enable()

I just tested this, and it doesn't seem to improve much:

root@debian:~# stress-ng --timeout 60 --times --verify --metrics --no-rand-seed --mmapaddr 64
stress-ng: info:  [671] dispatching hogs: 64 mmapaddr
stress-ng: info:  [671] successful run completed in 60.07s (1 min, 0.07 secs)
stress-ng: info:  [671] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [671]                           (secs)    (secs) (secs)   (real time) (usr+sys time)
stress-ng: info:  [671] mmapaddr       19803127     60.01    235.20 1146.76    330007.29     14329.74
stress-ng: info:  [671] for a 60.07s run time:
stress-ng: info:  [671]    1441.59s available CPU time
stress-ng: info:  [671]     235.57s user time   ( 16.34%)
stress-ng: info:  [671]    1147.20s system time ( 79.58%)
stress-ng: info:  [671]    1382.77s total time  ( 95.92%)
stress-ng: info:  [671] load average: 41.42 11.91 4.10

The _raw_spin_unlock_irqrestore hotspot still exists:

  15.87%  [kernel]  [k] _raw_spin_unlock_irqrestore
   9.18%  [kernel]  [k] clear_page_rep
   7.03%  [kernel]  [k] do_syscall_64
   3.67%  [kernel]  [k] _raw_spin_lock
   3.28%  [kernel]  [k] __slab_free
   2.03%  [kernel]  [k] rcu_cblist_dequeue
   1.98%  [kernel]  [k] flush_tlb_mm_range
   1.88%  [kernel]  [k] lruvec_stat_mod_folio.part.131
   1.85%  [kernel]  [k] get_page_from_freelist
   1.64%  [kernel]  [k] kmem_cache_alloc_noprof
   1.61%  [kernel]  [k] tlb_remove_table_rcu
   1.39%  [kernel]  [k] mtree_range_walk
   1.36%  [kernel]  [k] __alloc_frozen_pages_noprof
   1.27%  [kernel]  [k] pmd_install
   1.24%  [kernel]  [k] memcpy_orig
   1.23%  [kernel]  [k] __call_rcu_common.constprop.77
   1.17%  [kernel]  [k] free_pgd_range
   1.15%  [kernel]  [k] pte_alloc_one

The call stack is as follows:

bpftrace -e 'k:_raw_spin_unlock_irqrestore {@[kstack,comm]=count();} interval:s:1 {exit();}'

@[
    _raw_spin_unlock_irqrestore+5
    hrtimer_interrupt+289
    __sysvec_apic_timer_interrupt+85
    sysvec_apic_timer_interrupt+108
    asm_sysvec_apic_timer_interrupt+26
    tlb_remove_table_rcu+48
    rcu_do_batch+424
    rcu_core+401
    handle_softirqs+204
    irq_exit_rcu+208
    sysvec_apic_timer_interrupt+61
    asm_sysvec_apic_timer_interrupt+26
, stress-ng-mmapa]: 8

The tlb_remove_table_rcu() is called very rarely, so I guess the
PCP cache is basically empty at this time, resulting in the following
call stack:


But I think this may be just an extreme test scenario, because my test
machine has no other load at this time. Under normal workload, page
table pages should only occupy a small part of the PCP cache, and
delayed freeing should not have much impact on the PCP cache.

Thanks!