Re: [PATCH -V2] mm: pcp: scale batch to reduce number of high order pcp flushes on deallocation

From: Raghavendra K T
Date: Tue Mar 25 2025 - 04:00:25 EST


On 3/19/2025 1:44 PM, Nikhil Dhama wrote:
[...]
And, do you run network related workloads on one machine? If so, please
try to run them on two machines instead, with clients and servers run on
different machines. At least, please use different sockets for clients
and servers. Because larger pcp->free_count will make it easier to
trigger free_high heuristics. If that is the case, please try to
optimize free_high heuristics directly too.

I agree with Ying Huang, the above change is not the best possible fix for
the issue. On futher analysis I figured that root cause of the issue is
the frequent pcp high order flushes. During a 20sec iperf3 run
I observed on avg 5 pcp high order flushes in kernel v6.6, whereas, in
v6.7, I observed about 170 pcp high order flushes.
Tracing pcp->free_count, I figured with the patch v1 (patch I suggested
earlier) free_count is going into negatives which reduces the number of
times free_high heuristics is triggered hence reducing the high order
flushes.

As Ying Huang Suggested, it helps the performance on increasing the batch size
for free_high heuristics. I tried different scaling factors to find best
suitable batch value for free_high heuristics,


score # free_high
----------- ----- -----------
v6.6 (base) 100 4
v6.12 (batch*1) 69 170
batch*2 69 150
batch*4 74 101
batch*5 100 53
batch*6 100 36
batch*8 100 3
scaling batch for free_high heuristics with a factor of 5 restores the
performance.

Hello Nikhil,

Thanks for looking further on this. But from design standpoint,
how a batch-size of 5 is helping here is not clear (Andrew's original
question).

Any case can you post the patch-set in a new email so that the below
patch is not lost in discussion thread?


On AMD 2-node machine, score for other benchmarks with patch v2
are as follows:

iperf3 lmbench3 netperf kbuild
(AF_UNIX) (SCTP_STREAM_MANY)
------- --------- ----------------- ------
v6.6 (base) 100 100 100 100
v6.12 69 113 98.5 98.8
v6.12 with patch v2 100 112.5 100.1 99.6

for network workloads, clients and server are running on different
machines conneted via Mellanox Connect-7 NIC.

number of free_high:
iperf3 lmbench3 netperf kbuild
(AF_UNIX) (SCTP_STREAM_MANY)
------- --------- ----------------- ------
v6.6 (base) 5 12 6 2
v6.12 170 11 92 2
v6.12 with patch v2 58 11 34 2


Signed-off-by: Nikhil Dhama <nikhil.dhama@xxxxxxx>
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Cc: Ying Huang <huang.ying.caritas@xxxxxxxxx>
Cc: linux-mm@xxxxxxxxx
Cc: linux-kernel@xxxxxxxxxxxxxxx
Cc: Bharata B Rao <bharata@xxxxxxx>
Cc: Raghavendra <raghavendra.kodsarathimmappa@xxxxxxx>
---
mm/page_alloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b6958333054d..326d5fbae353 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2617,7 +2617,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
* stops will be drained from vmstat refresh context.
*/
if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
- free_high = (pcp->free_count >= batch &&
+ free_high = (pcp->free_count >= (batch*5) &&
(pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) &&
(!(pcp->flags & PCPF_FREE_HIGH_BATCH) ||
pcp->count >= READ_ONCE(batch)));