Re: [PATCH] mm/page_alloc: Occasionally relinquish zone lock in batch freeing

From: Shakeel Butt
Date: Tue Aug 19 2025 - 13:16:22 EST


On Tue, Aug 19, 2025 at 10:15:13AM +0100, Kiryl Shutsemau wrote:
> On Mon, Aug 18, 2025 at 11:58:03AM -0700, Joshua Hahn wrote:
> > While testing workloads with high sustained memory pressure on large machines
> > (1TB memory, 316 CPUs), we saw an unexpectedly high number of softlockups.
> > Further investigation showed that the lock in free_pcppages_bulk was being held
> > for a long time, even being held while 2k+ pages were being freed.
> >
> > Instead of holding the lock for the entirety of the freeing, check to see if
> > the zone lock is contended every pcp->batch pages. If there is contention,
> > relinquish the lock so that other processors have a change to grab the lock
> > and perform critical work.
>
> Hm. It doesn't necessary to be contention on the lock, but just that you
> holding the lock for too long so the CPU is not available for the scheduler.
>
> > In our fleet, we have seen that performing batched lock freeing has led to
> > significantly lower rates of softlockups, while incurring relatively small
> > regressions (relative to the workload and relative to the variation).
> >
> > The following are a few synthetic benchmarks:
> >
> > Test 1: Small machine (30G RAM, 36 CPUs)
> >
> > stress-ng --vm 30 --vm-bytes 1G -M -t 100
> > +----------------------+---------------+-----------+
> > | Metric | Variation (%) | Delta (%) |
> > +----------------------+---------------+-----------+
> > | bogo ops | 0.0076 | -0.0183 |
> > | bogo ops/s (real) | 0.0064 | -0.0207 |
> > | bogo ops/s (usr+sys) | 0.3151 | +0.4141 |
> > +----------------------+---------------+-----------+
> >
> > stress-ng --vm 20 --vm-bytes 3G -M -t 100
> > +----------------------+---------------+-----------+
> > | Metric | Variation (%) | Delta (%) |
> > +----------------------+---------------+-----------+
> > | bogo ops | 0.0295 | -0.0078 |
> > | bogo ops/s (real) | 0.0267 | -0.0177 |
> > | bogo ops/s (usr+sys) | 1.7079 | -0.0096 |
> > +----------------------+---------------+-----------+
> >
> > Test 2: Big machine (250G RAM, 176 CPUs)
> >
> > stress-ng --vm 50 --vm-bytes 5G -M -t 100
> > +----------------------+---------------+-----------+
> > | Metric | Variation (%) | Delta (%) |
> > +----------------------+---------------+-----------+
> > | bogo ops | 0.0362 | -0.0187 |
> > | bogo ops/s (real) | 0.0391 | -0.0220 |
> > | bogo ops/s (usr+sys) | 2.9603 | +1.3758 |
> > +----------------------+---------------+-----------+
> >
> > stress-ng --vm 10 --vm-bytes 30G -M -t 100
> > +----------------------+---------------+-----------+
> > | Metric | Variation (%) | Delta (%) |
> > +----------------------+---------------+-----------+
> > | bogo ops | 2.3130 | -0.0754 |
> > | bogo ops/s (real) | 3.3069 | -0.8579 |
> > | bogo ops/s (usr+sys) | 4.0369 | -1.1985 |
> > +----------------------+---------------+-----------+
> >
> > Suggested-by: Chris Mason <clm@xxxxxx>
> > Co-developed-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> > Signed-off-by: Joshua Hahn <joshua.hahnjy@xxxxxxxxx>
> >
> > ---
> > mm/page_alloc.c | 15 ++++++++++++++-
> > 1 file changed, 14 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index a8a84c3b5fe5..bd7a8da3e159 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1238,6 +1238,8 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> > * below while (list_empty(list)) loop.
> > */
> > count = min(pcp->count, count);
> > + if (!count)
> > + return;
> >
> > /* Ensure requested pindex is drained first. */
> > pindex = pindex - 1;
> > @@ -1247,6 +1249,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> > while (count > 0) {
> > struct list_head *list;
> > int nr_pages;
> > + int batch = min(count, pcp->batch);
> >
> > /* Remove pages from lists in a round-robin fashion. */
> > do {
> > @@ -1267,12 +1270,22 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> >
> > /* must delete to avoid corrupting pcp list */
> > list_del(&page->pcp_list);
> > + batch -= nr_pages;
> > count -= nr_pages;
> > pcp->count -= nr_pages;
> >
> > __free_one_page(page, pfn, zone, order, mt, FPI_NONE);
> > trace_mm_page_pcpu_drain(page, order, mt);
> > - } while (count > 0 && !list_empty(list));
> > + } while (batch > 0 && !list_empty(list));
> > +
> > + /*
> > + * Prevent starving the lock for other users; every pcp->batch
> > + * pages freed, relinquish the zone lock if it is contended.
> > + */
> > + if (count && spin_is_contended(&zone->lock)) {
>
> I would rather drop the count thing and do something like this:
>
> if (need_resched() || spin_needbreak(&zone->lock) {
> spin_unlock_irqrestore(&zone->lock, flags);
> cond_resched();

Can this function be called from non-sleepable context?

> spin_lock_irqsave(&zone->lock, flags);
> }
>
> > + spin_unlock_irqrestore(&zone->lock, flags);
> > + spin_lock_irqsave(&zone->lock, flags);
> > + }
> > }
> >
> > spin_unlock_irqrestore(&zone->lock, flags);
> >
> > base-commit: 137a6423b60fe0785aada403679d3b086bb83062
> > --
> > 2.47.3
>
> --
> Kiryl Shutsemau / Kirill A. Shutemov