Nikhil Dhama <nikhil.dhama@xxxxxxx> writes:
On 4/3/2025 7:06 AM, Huang, Ying wrote:Thanks a lot for test and results!
Nikhil Dhama <nikhil.dhama@xxxxxxx> writes:Hi,
On 3/30/2025 12:22 PM, Huang, Ying wrote:Thanks a lot for your data!
Hi, Nikhil,Hi, I ran netperf-tcp as in commit f26b3fa04611 ("mm/page_alloc: limit
Nikhil Dhama <nikhil.dhama@xxxxxxx> writes:
In old pcp design, pcp->free_factor gets incremented in nr_pcp_free()Em..., IIUC, this may disable the free_high optimizations. free_high
which is invoked by free_pcppages_bulk(). So, it used to increase
free_factor by 1 only when we try to reduce the size of pcp list or
flush for high order.
and free_high used to trigger only for order > 0 and order <
costly_order and free_factor > 0.
and free_factor used to scale down by a factor of 2 on every successful
allocation.
for iperf3 I noticed that with older design in kernel v6.6, pcp list was
drained mostly when pcp->count > high (more often when count goes above
530). and most of the time free_factor was 0, triggering very few
high order flushes.
Whereas in the current design, free_factor is changed to free_count to keep
track of the number of pages freed contiguously,
and with this design for iperf3, pcp list is getting flushed more
frequently because free_high heuristics is triggered more often now.
In current design, free_count is incremented on every deallocation,
irrespective of whether pcp list was reduced or not. And logic to
trigger free_high is if free_count goes above batch (which is 63) and
there are two contiguous page free without any allocation.
(and with cache slice optimisation).
With this design, I observed that high order pcp list is drained as soon
as both count and free_count goes about 63.
and due to this more aggressive high order flushing, applications
doing contiguous high order allocation will require to go to global list
more frequently.
On a 2-node AMD machine with 384 vCPUs on each node,
connected via Mellonox connectX-7, I am seeing a ~30% performance
reduction if we scale number of iperf3 client/server pairs from 32 to 64.
So, though this new design reduced the time to detect high order flushes,
but for application which are allocating high order pages more
frequently it may be flushing the high order list pre-maturely.
This motivates towards tuning on how late or early we should flush
high order lists.
for free_high heuristics. I tried to scale batch and tune it,
which will delay the free_high flushes.
score # free_high
----------- ----- -----------
v6.6 (base) 100 4
v6.12 (batch*1) 69 170
batch*2 69 150
batch*4 74 101
batch*5 100 53
batch*6 100 36
batch*8 100 3
scaling batch for free_high heuristics with a factor of 5 or above restores
the performance, as it is reducing the number of high order flushes.
On 2-node AMD server with 384 vCPUs each,score for other benchmarks with
patch v2 along with iperf3 are as follows:
optimization is introduced by Mel Gorman in commit f26b3fa04611
("mm/page_alloc: limit number of high-order pages on PCP during bulk
free"). So, this may trigger regression for the workloads in the
commit. Can you try it too?
number of high-order pages on PCP during bulk free"),
On a 2-node AMD server with 384 vCPUs, results I observed are as follows:
6.12 6.12
vanilla freehigh-heuristicsopt
Hmean 64 732.14 ( 0.00%) 736.90 ( 0.65%)
Hmean 128 1417.46 ( 0.00%) 1421.54 ( 0.29%)
Hmean 256 2679.67 ( 0.00%) 2689.68 ( 0.37%)
Hmean 1024 8328.52 ( 0.00%) 8413.94 ( 1.03%)
Hmean 2048 12716.98 ( 0.00%) 12838.94 ( 0.96%)
Hmean 3312 15787.79 ( 0.00%) 15822.40 ( 0.22%)
Hmean 4096 17311.91 ( 0.00%) 17328.74 ( 0.10%)
Hmean 8192 20310.73 ( 0.00%) 20447.12 ( 0.67%)
It is not regressing for netperf-tcp.
Think about this again. Compared with the pcp->free_factor solution,
the pcp->free_count solution will trigger free_high heuristics more
early, this causes performance regression in your workloads. So, it's
reasonable to raise the bar to trigger free_high. And, it's also
reasonable to use a stricter threshold, as you have done in this patch.
However, "5 * batch" appears too magic and adapt to one type of machine.
Let's step back to do some analysis. In the original pcp->free_factor
solution, free_high is triggered for contiguous freeing with size
ranging from "batch" to "pcp->high + batch". So, the average value is
about "batch + pcp->high / 2". While in the pcp->free_count solution,
free_high will be triggered for contiguous freeing with size "batch".
So, to restore the original behavior, it seems that we can use the
threshold "batch + pcp->high_min / 2". Do you think that this is
reasonable? If so, can you give it a try?
I have tried your suggestion as setting threshold to "batch + pcp->high_min / 2",
scores for different benchmarks on the same machine
(2-Node AMD server with 384 vCPUs each) are as follows:
iperf3 lmbench3 netperf kbuild
(AF_UNIX) (SCTP_STREAM_MANY)
------- --------- ----------------- ------
v6.6 vanilla (base) 100 100 100 100
v6.12 vanilla 69 113 98.5 98.8
v6.12 avg_threshold 100 110.3 100.2 99.3
and for netperf-tcp, it is as follows:
6.12 6.12
vanilla avg_free_high_threshold
Hmean 64 732.14 ( 0.00%) 730.45 ( -0.23%)
Hmean 128 1417.46 ( 0.00%) 1419.44 ( 0.14%)
Hmean 256 2679.67 ( 0.00%) 2676.45 ( -0.12%)
Hmean 1024 8328.52 ( 0.00%) 8339.34 ( 0.13%)
Hmean 2048 12716.98 ( 0.00%) 12743.68 ( 0.21%)
Hmean 3312 15787.79 ( 0.00%) 15887.25 ( 0.63%)
Hmean 4096 17311.91 ( 0.00%) 17332.68 ( 0.12%)
Hmean 8192 20310.73 ( 0.00%) 20465.09 ( 0.76%)
It looks good to me. Can you submit a formal patch?