Re: [PATCH v3] mm: pcp: increase pcp->free_count threshold to trigger free_high

From: Huang, Ying
Date: Thu Apr 10 2025 - 22:17:13 EST


Hi, Nikhil,

Sorry for late reply.

Nikhil Dhama <nikhil.dhama@xxxxxxx> writes:

> In old pcp design, pcp->free_factor gets incremented in nr_pcp_free()
> which is invoked by free_pcppages_bulk(). So, it used to increase
> free_factor by 1 only when we try to reduce the size of pcp list or
> flush for high order, and free_high used to trigger only
> for order > 0 and order < costly_order and pcp->free_factor > 0.
>
> For iperf3 I noticed that with older design in kernel v6.6, pcp list was
> drained mostly when pcp->count > high (more often when count goes above
> 530). and most of the time pcp->free_factor was 0, triggering very few
> high order flushes.
>
> But this is changed in the current design, introduced in commit 6ccdcb6d3a74
> ("mm, pcp: reduce detecting time of consecutive high order page freeing"),
> where pcp->free_factor is changed to pcp->free_count to keep track of the
> number of pages freed contiguously. In this design, pcp->free_count is
> incremented on every deallocation, irrespective of whether pcp list was
> reduced or not. And logic to trigger free_high is if pcp->free_count goes
> above batch (which is 63) and there are two contiguous page free without
> any allocation.

The design changes because pcp->high can become much higher than that
before it. This makes it much harder to trigger free_high, which causes
some performance regressions too.

> With this design, for iperf3, pcp list is getting flushed more frequently
> because free_high heuristics is triggered more often now. I observed that
> high order pcp list is drained as soon as both count and free_count goes
> above 63.
>
> Due to this more aggressive high order flushing, applications
> doing contiguous high order allocation will require to go to global list
> more frequently.
>
> On a 2-node AMD machine with 384 vCPUs on each node,
> connected via Mellonox connectX-7, I am seeing a ~30% performance
> reduction if we scale number of iperf3 client/server pairs from 32 to 64.
>
> Though this new design reduced the time to detect high order flushes,
> but for application which are allocating high order pages more
> frequently it may be flushing the high order list pre-maturely.
> This motivates towards tuning on how late or early we should flush
> high order lists.
>
> So, in this patch, we increased the pcp->free_count threshold to
> trigger free_high from "batch" to "batch + pcp->high_min / 2".
> This new threshold keeps high order pages in pcp list for a
> longer duration which can help the application doing high order
> allocations frequently.

IIUC, we restore the original behavior with "batch + pcp->high / 2" as
in my analysis in

https://lore.kernel.org/all/875xjmuiup.fsf@DESKTOP-5N7EMDA/

If you think my analysis is correct, can you add that in patch
description too? This makes it easier for people to know why the code
looks this way.

> With this patch performace to Iperf3 is restored and
> score for other benchmarks on the same machine are as follows:
>
> iperf3 lmbench3 netperf kbuild
> (AF_UNIX) (SCTP_STREAM_MANY)
> ------- --------- ----------------- ------
> v6.6 vanilla (base) 100 100 100 100
> v6.12 vanilla 69 113 98.5 98.8
> v6.12 + this patch 100 110.3 100.2 99.3
>
>
> netperf-tcp:
>
> 6.12 6.12
> vanilla this_patch
> Hmean 64 732.14 ( 0.00%) 730.45 ( -0.23%)
> Hmean 128 1417.46 ( 0.00%) 1419.44 ( 0.14%)
> Hmean 256 2679.67 ( 0.00%) 2676.45 ( -0.12%)
> Hmean 1024 8328.52 ( 0.00%) 8339.34 ( 0.13%)
> Hmean 2048 12716.98 ( 0.00%) 12743.68 ( 0.21%)
> Hmean 3312 15787.79 ( 0.00%) 15887.25 ( 0.63%)
> Hmean 4096 17311.91 ( 0.00%) 17332.68 ( 0.12%)
> Hmean 8192 20310.73 ( 0.00%) 20465.09 ( 0.76%)
>
> Fixes: 6ccdcb6d3a74 ("mm, pcp: reduce detecting time of consecutive high order page freeing")
>
> Signed-off-by: Nikhil Dhama <nikhil.dhama@xxxxxxx>
> Suggested-by: Huang Ying <ying.huang@xxxxxxxxxxxxxxxxx>
> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Cc: Huang Ying <huang.ying.caritas@xxxxxxxxx>
> Cc: linux-mm@xxxxxxxxx
> Cc: linux-kernel@xxxxxxxxxxxxxxx
> Cc: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
>
> ---
> v1: https://lore.kernel.org/linux-mm/20250107091724.35287-1-nikhil.dhama@xxxxxxx/
> v2: https://lore.kernel.org/linux-mm/20250325171915.14384-1-nikhil.dhama@xxxxxxx/
>
> mm/page_alloc.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b6958333054d..569dcf1f731f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2617,7 +2617,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
> * stops will be drained from vmstat refresh context.
> */
> if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
> - free_high = (pcp->free_count >= batch &&
> + free_high = (pcp->free_count >= (batch + pcp->high_min / 2) &&
> (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) &&
> (!(pcp->flags & PCPF_FREE_HIGH_BATCH) ||
> pcp->count >= READ_ONCE(batch)));

---
Best Regards,
Huang, Ying