Re: [PATCH] mm: reduce lock contention of pcp buffer refill
From: Vlastimil Babka
Date: Wed Feb 08 2023 - 10:11:35 EST
On 2/7/23 17:11, Alexander Halbuer wrote:
> On 2/3/23 00:25, Andrew Morton wrote:
>> On Wed, 1 Feb 2023 17:25:49 +0100 Alexander Halbuer <halbuer@xxxxxxxxxxxxxxxxxxx> wrote:
>>
>>> The `rmqueue_bulk` function batches the allocation of multiple elements to
>>> refill the per-CPU buffers into a single hold of the zone lock. Each
>>> element is allocated and checked using the `check_pcp_refill` function.
>>> The check touches every related struct page which is especially expensive
>>> for higher order allocations (huge pages). This patch reduces the time
>>> holding the lock by moving the check out of the critical section similar
>>> to the `rmqueue_buddy` function which allocates a single element.
>>> Measurements of parallel allocation-heavy workloads show a reduction of
>>> the average huge page allocation latency of 50 percent for two cores and
>>> nearly 90 percent for 24 cores.
>> Sounds nice.
>>
>> Were you able to test how much benefit we get by simply removing the
>> check_new_pages() call from rmqueue_bulk()?
> I did some further investigations and measurements to quantify potential
> performance gains. Benchmarks ran on a machine with 24 physical cores
> fixed at 2.1 GHz. The results show significant performance gains with the
> patch applied in parallel scenarios. Eliminating the check reduces
> allocation latency further, especially for low core counts.
Are the benchmarks done via kernel module calling bulk allocation, or from
userspace?
> The following tables show the average allocation latencies for huge pages
> and normal pages for three different configurations:
> The unpatched kernel, the patched kernel, and an additional version
> without the check.
>
> Huge pages
> +-------+---------+-------+----------+---------+----------+
> | Cores | Default | Patch | Patch | NoCheck | NoCheck |
> | | (ns) | (ns) | Diff | (ns) | Diff |
> +-------+---------+-------+----------+---------+----------+
> | 1 | 127 | 124 | (-2.4%) | 118 | (-7.1%) |
> | 2 | 140 | 140 | (-0.0%) | 134 | (-4.3%) |
> | 3 | 143 | 142 | (-0.7%) | 134 | (-6.3%) |
> | 4 | 178 | 159 | (-10.7%) | 156 | (-12.4%) |
> | 6 | 269 | 239 | (-11.2%) | 237 | (-11.9%) |
> | 8 | 363 | 321 | (-11.6%) | 319 | (-12.1%) |
> | 10 | 454 | 409 | (-9.9%) | 409 | (-9.9%) |
> | 12 | 545 | 494 | (-9.4%) | 488 | (-10.5%) |
> | 14 | 639 | 578 | (-9.5%) | 574 | (-10.2%) |
> | 16 | 735 | 660 | (-10.2%) | 653 | (-11.2%) |
> | 20 | 915 | 826 | (-9.7%) | 815 | (-10.9%) |
> | 24 | 1105 | 992 | (-10.2%) | 982 | (-11.1%) |
> +-------+---------+-------+----------+---------+----------+
>
> Normal pages
> +-------+---------+-------+----------+---------+----------+
> | Cores | Default | Patch | Patch | NoCheck | NoCheck |
> | | (ns) | (ns) | Diff | (ns) | Diff |
> +-------+---------+-------+----------+---------+----------+
> | 1 | 2790 | 2767 | (-0.8%) | 171 | (-93.9%) |
> | 2 | 6685 | 3484 | (-47.9%) | 519 | (-92.2%) |
> | 3 | 10501 | 3599 | (-65.7%) | 855 | (-91.9%) |
> | 4 | 14264 | 3635 | (-74.5%) | 1139 | (-92.0%) |
> | 6 | 21800 | 3551 | (-83.7%) | 1713 | (-92.1%) |
> | 8 | 29563 | 3570 | (-87.9%) | 2268 | (-92.3%) |
> | 10 | 37210 | 3845 | (-89.7%) | 2872 | (-92.3%) |
> | 12 | 44780 | 4452 | (-90.1%) | 3417 | (-92.4%) |
> | 14 | 52576 | 5100 | (-90.3%) | 4020 | (-92.4%) |
> | 16 | 60118 | 5785 | (-90.4%) | 4604 | (-92.3%) |
> | 20 | 75037 | 7270 | (-90.3%) | 6486 | (-91.4%) |
> | 24 | 90226 | 8712 | (-90.3%) | 7061 | (-92.2%) |
> +-------+---------+-------+----------+---------+----------+
Nice improvements. I assume the table headings are switched? Why would
Normal be more epensive than Huge?
>>
>> Vlastimil, I find this quite confusing:
>>
>> #ifdef CONFIG_DEBUG_VM
>> /*
>> * With DEBUG_VM enabled, order-0 pages are checked for expected state when
>> * being allocated from pcp lists. With debug_pagealloc also enabled, they are
>> * also checked when pcp lists are refilled from the free lists.
>> */
>> static inline bool check_pcp_refill(struct page *page, unsigned int order)
>> {
>> if (debug_pagealloc_enabled_static())
>> return check_new_pages(page, order);
>> else
>> return false;
>> }
>>
>> static inline bool check_new_pcp(struct page *page, unsigned int order)
>> {
>> return check_new_pages(page, order);
>> }
>> #else
>> /*
>> * With DEBUG_VM disabled, free order-0 pages are checked for expected state
>> * when pcp lists are being refilled from the free lists. With debug_pagealloc
>> * enabled, they are also checked when being allocated from the pcp lists.
>> */
>> static inline bool check_pcp_refill(struct page *page, unsigned int order)
>> {
>> return check_new_pages(page, order);
>> }
>> static inline bool check_new_pcp(struct page *page, unsigned int order)
>> {
>> if (debug_pagealloc_enabled_static())
>> return check_new_pages(page, order);
>> else
>> return false;
>> }
>> #endif /* CONFIG_DEBUG_VM */
>>
>> and the 4462b32c9285b5 changelog is a struggle to follow.
>>
>> Why are we performing *any* checks when CONFIG_DEBUG_VM=n and when
>> debug_pagealloc_enabled is false?
>>
>> Anyway, these checks sounds quite costly so let's revisit their
>> desirability?
>>