Re: [linus:master] [x86] 4817f70c25: stress-ng.mmapaddr.ops_per_sec 63.0% regression
From: Rik van Riel
Date: Wed Jan 29 2025 - 10:59:47 EST
On Wed, 29 Jan 2025 16:14:01 +0800
Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx> wrote:
>
> It seems that the pcp lock is held when doing tlb_remove_table_rcu(), so
> trylock fails, then bypassing PCP and calling free_one_page() directly,
> which leads to the hot spot of zone lock.
Below is a tentative fix for the issue. It is kind of a big hammer,
and maybe the RCU people have a better idea on how to solve this
problem, but it may be worth giving this a try to see if it helps
with the regression you identified.
---8<---
From 2b0302f821d1fc94c968ac533dcc62b9ffe00c38 Mon Sep 17 00:00:00 2001
From: Rik van Riel <riel@xxxxxxxxxxx>
Date: Wed, 29 Jan 2025 10:51:51 -0500
Subject: [PATCH 2/2] mm,rcu: prevent RCU callbacks from running with pcp lock
held
Enabling MMU_GATHER_RCU_TABLE_FREE can create contention on the
zone->lock. This turns out to be because in some configurations
RCU callbacks are called when IRQs are re-enabled inside
rmqueue_bulk, while the CPU is still holding the per-cpu pages lock.
That results in the RCU callbacks being unable to grab the
PCP lock, and taking the slow path with the zone->lock for
each item freed.
Speed things up by blocking RCU callbacks while holding the
PCP lock.
Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx>
Reported-by: Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx>
---
mm/page_alloc.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e469c7ef9a4..b3c4002ab0ab 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3036,6 +3036,13 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
return NULL;
}
+ /*
+ * Prevent RCU callbacks from being run from the spin_lock_irqrestore
+ * inside rmqueue_bulk, while the pcp lock is held; that would result
+ * in each RCU free taking the zone->lock, which can be very slow.
+ */
+ rcu_read_lock();
+
/*
* On allocation, reduce the number of pages that are batch freed.
* See nr_pcp_free() where free_factor is increased for subsequent
@@ -3046,6 +3053,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
pcp_spin_unlock(pcp);
pcp_trylock_finish(UP_flags);
+ rcu_read_unlock();
if (page) {
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
zone_statistics(preferred_zone, zone, 1);
--
2.47.1