Re: [PATCH v3] mm: pcp: increase pcp->free_count threshold to trigger free_high
From: Huang, Ying
Date: Fri Apr 11 2025 - 02:16:04 EST
Raghavendra K T <raghavendra.kt@xxxxxxx> writes:
> On 4/11/2025 7:46 AM, Huang, Ying wrote:
>> Hi, Nikhil,
>> Sorry for late reply.
>> Nikhil Dhama <nikhil.dhama@xxxxxxx> writes:
>>
>>> In old pcp design, pcp->free_factor gets incremented in nr_pcp_free()
>>> which is invoked by free_pcppages_bulk(). So, it used to increase
>>> free_factor by 1 only when we try to reduce the size of pcp list or
>>> flush for high order, and free_high used to trigger only
>>> for order > 0 and order < costly_order and pcp->free_factor > 0.
>>>
>>> For iperf3 I noticed that with older design in kernel v6.6, pcp list was
>>> drained mostly when pcp->count > high (more often when count goes above
>>> 530). and most of the time pcp->free_factor was 0, triggering very few
>>> high order flushes.
>>>
>>> But this is changed in the current design, introduced in commit 6ccdcb6d3a74
>>> ("mm, pcp: reduce detecting time of consecutive high order page freeing"),
>>> where pcp->free_factor is changed to pcp->free_count to keep track of the
>>> number of pages freed contiguously. In this design, pcp->free_count is
>>> incremented on every deallocation, irrespective of whether pcp list was
>>> reduced or not. And logic to trigger free_high is if pcp->free_count goes
>>> above batch (which is 63) and there are two contiguous page free without
>>> any allocation.
>> The design changes because pcp->high can become much higher than
>> that
>> before it. This makes it much harder to trigger free_high, which causes
>> some performance regressions too.
>>
>>> With this design, for iperf3, pcp list is getting flushed more frequently
>>> because free_high heuristics is triggered more often now. I observed that
>>> high order pcp list is drained as soon as both count and free_count goes
>>> above 63.
>>>
>>> Due to this more aggressive high order flushing, applications
>>> doing contiguous high order allocation will require to go to global list
>>> more frequently.
>>>
>>> On a 2-node AMD machine with 384 vCPUs on each node,
>>> connected via Mellonox connectX-7, I am seeing a ~30% performance
>>> reduction if we scale number of iperf3 client/server pairs from 32 to 64.
>>>
>>> Though this new design reduced the time to detect high order flushes,
>>> but for application which are allocating high order pages more
>>> frequently it may be flushing the high order list pre-maturely.
>>> This motivates towards tuning on how late or early we should flush
>>> high order lists.
>>>
>>> So, in this patch, we increased the pcp->free_count threshold to
>>> trigger free_high from "batch" to "batch + pcp->high_min / 2".
>>> This new threshold keeps high order pages in pcp list for a
>>> longer duration which can help the application doing high order
>>> allocations frequently.
>> IIUC, we restore the original behavior with "batch + pcp->high / 2"
>> as
>> in my analysis in
>> https://lore.kernel.org/all/875xjmuiup.fsf@DESKTOP-5N7EMDA/
>> If you think my analysis is correct, can you add that in patch
>> description too? This makes it easier for people to know why the code
>> looks this way.
>>
>
> Yes. This makes sense. Andrew has already included the patch in mm tree.
>
> Nikhil,
>
> Could you please help with the updated write up based on Ying's
> suggestion assuming it works for Andrew?
Thanks!
Just send a updated version, Andrew will update the patch in mm tree
unless it has been merged by mm-stable.
---
Best Regards,
Huang, Ying