Re: [PATCH] mm: reduce lock contention of pcp buffer refill

From: Vlastimil Babka
Date: Wed Feb 08 2023 - 10:11:35 EST

Next message: Steven Price: "Re: [PATCH 0/9] Panfrost: Improve and add MediaTek SoCs support"
Previous message: Guenter Roeck: "Re: [PATCH v2] watchdog: sbsa_wdog: Make sure the timeout programming is within the limits"
In reply to: Kees Cook: "Re: [PATCH] mm: reduce lock contention of pcp buffer refill"
Next in thread: Alexander Halbuer: "Re: [PATCH] mm: reduce lock contention of pcp buffer refill"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2/7/23 17:11, Alexander Halbuer wrote:
> On 2/3/23 00:25, Andrew Morton wrote:
>> On Wed, 1 Feb 2023 17:25:49 +0100 Alexander Halbuer <halbuer@xxxxxxxxxxxxxxxxxxx> wrote:
>>
>>> The `rmqueue_bulk` function batches the allocation of multiple elements to
>>> refill the per-CPU buffers into a single hold of the zone lock. Each
>>> element is allocated and checked using the `check_pcp_refill` function.
>>> The check touches every related struct page which is especially expensive
>>> for higher order allocations (huge pages). This patch reduces the time
>>> holding the lock by moving the check out of the critical section similar
>>> to the `rmqueue_buddy` function which allocates a single element.
>>> Measurements of parallel allocation-heavy workloads show a reduction of
>>> the average huge page allocation latency of 50 percent for two cores and
>>> nearly 90 percent for 24 cores.
>> Sounds nice.
>>
>> Were you able to test how much benefit we get by simply removing the
>> check_new_pages() call from rmqueue_bulk()?
> I did some further investigations and measurements to quantify potential
> performance gains. Benchmarks ran on a machine with 24 physical cores
> fixed at 2.1 GHz. The results show significant performance gains with the
> patch applied in parallel scenarios. Eliminating the check reduces
> allocation latency further, especially for low core counts.

Are the benchmarks done via kernel module calling bulk allocation, or from
userspace?

> The following tables show the average allocation latencies for huge pages
> and normal pages for three different configurations:
> The unpatched kernel, the patched kernel, and an additional version
> without the check.
>
> Huge pages
> +-------+---------+-------+----------+---------+----------+
> | Cores | Default | Patch |    Patch | NoCheck | NoCheck |
> |       |    (ns) | (ns) |     Diff |    (ns) |     Diff |
> +-------+---------+-------+----------+---------+----------+
> |     1 |     127 |   124 | (-2.4%) |     118 | (-7.1%) |
> |     2 |     140 |   140 | (-0.0%) |     134 | (-4.3%) |
> |     3 |     143 |   142 | (-0.7%) |     134 | (-6.3%) |
> |     4 |     178 |   159 | (-10.7%) |     156 | (-12.4%) |
> |     6 |     269 |   239 | (-11.2%) |     237 | (-11.9%) |
> |     8 |     363 |   321 | (-11.6%) |     319 | (-12.1%) |
> |    10 |     454 |   409 | (-9.9%) |     409 | (-9.9%) |
> |    12 |     545 |   494 | (-9.4%) |     488 | (-10.5%) |
> |    14 |     639 |   578 | (-9.5%) |     574 | (-10.2%) |
> |    16 |     735 |   660 | (-10.2%) |     653 | (-11.2%) |
> |    20 |     915 |   826 | (-9.7%) |     815 | (-10.9%) |
> |    24 |    1105 |   992 | (-10.2%) |     982 | (-11.1%) |
> +-------+---------+-------+----------+---------+----------+
>
> Normal pages
> +-------+---------+-------+----------+---------+----------+
> | Cores | Default | Patch |    Patch | NoCheck | NoCheck |
> |       |    (ns) | (ns) |     Diff |    (ns) |     Diff |
> +-------+---------+-------+----------+---------+----------+
> |     1 |    2790 | 2767 | (-0.8%) |     171 | (-93.9%) |
> |     2 |    6685 | 3484 | (-47.9%) |     519 | (-92.2%) |
> |     3 |   10501 | 3599 | (-65.7%) |     855 | (-91.9%) |
> |     4 |   14264 | 3635 | (-74.5%) |    1139 | (-92.0%) |
> |     6 |   21800 | 3551 | (-83.7%) |    1713 | (-92.1%) |
> |     8 |   29563 | 3570 | (-87.9%) |    2268 | (-92.3%) |
> |    10 |   37210 | 3845 | (-89.7%) |    2872 | (-92.3%) |
> |    12 |   44780 | 4452 | (-90.1%) |    3417 | (-92.4%) |
> |    14 |   52576 | 5100 | (-90.3%) |    4020 | (-92.4%) |
> |    16 |   60118 | 5785 | (-90.4%) |    4604 | (-92.3%) |
> |    20 |   75037 | 7270 | (-90.3%) |    6486 | (-91.4%) |
> |    24 |   90226 | 8712 | (-90.3%) |    7061 | (-92.2%) |
> +-------+---------+-------+----------+---------+----------+

Nice improvements. I assume the table headings are switched? Why would
Normal be more epensive than Huge?

>>
>> Vlastimil, I find this quite confusing:
>>
>> #ifdef CONFIG_DEBUG_VM
>> /*
>> * With DEBUG_VM enabled, order-0 pages are checked for expected state when
>> * being allocated from pcp lists. With debug_pagealloc also enabled, they are
>> * also checked when pcp lists are refilled from the free lists.
>> */
>> static inline bool check_pcp_refill(struct page *page, unsigned int order)
>> {
>> if (debug_pagealloc_enabled_static())
>> return check_new_pages(page, order);
>> else
>> return false;
>> }
>>
>> static inline bool check_new_pcp(struct page *page, unsigned int order)
>> {
>> return check_new_pages(page, order);
>> }
>> #else
>> /*
>> * With DEBUG_VM disabled, free order-0 pages are checked for expected state
>> * when pcp lists are being refilled from the free lists. With debug_pagealloc
>> * enabled, they are also checked when being allocated from the pcp lists.
>> */
>> static inline bool check_pcp_refill(struct page *page, unsigned int order)
>> {
>> return check_new_pages(page, order);
>> }
>> static inline bool check_new_pcp(struct page *page, unsigned int order)
>> {
>> if (debug_pagealloc_enabled_static())
>> return check_new_pages(page, order);
>> else
>> return false;
>> }
>> #endif /* CONFIG_DEBUG_VM */
>>
>> and the 4462b32c9285b5 changelog is a struggle to follow.
>>
>> Why are we performing *any* checks when CONFIG_DEBUG_VM=n and when
>> debug_pagealloc_enabled is false?
>>
>> Anyway, these checks sounds quite costly so let's revisit their
>> desirability?
>>

Next message: Steven Price: "Re: [PATCH 0/9] Panfrost: Improve and add MediaTek SoCs support"
Previous message: Guenter Roeck: "Re: [PATCH v2] watchdog: sbsa_wdog: Make sure the timeout programming is within the limits"
In reply to: Kees Cook: "Re: [PATCH] mm: reduce lock contention of pcp buffer refill"
Next in thread: Alexander Halbuer: "Re: [PATCH] mm: reduce lock contention of pcp buffer refill"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]