Re: [PATCH v2 0/9] x86/clear_huge_page: multi-page clearing

From: Ankur Arora
Date: Tue Sep 05 2023 - 18:15:48 EST



Mateusz Guzik <mjguzik@xxxxxxxxx> writes:

> On Wed, Aug 30, 2023 at 11:49:49AM -0700, Ankur Arora wrote:
>> This series adds a multi-page clearing primitive, clear_pages(),
>> which enables more effective use of x86 string instructions by
>> advertising the real region-size to be cleared.
>>
>> Region-size can be used as a hint by uarchs to optimize the
>> clearing.
>>
>> Also add allow_resched() which marks a code-section as allowing
>> rescheduling in the irqentry_exit path. This allows clear_pages()
>> to get by without having to call cond_sched() periodically.
>> (preempt_model_full() already handles this via
>> irqentry_exit_cond_resched(), so we handle this similarly for
>> preempt_model_none() and preempt_model_voluntary().)
>>
>> Performance
>> ==
>>
>> With this demand fault performance gets a decent increase:
>>
>> *Milan* mm/clear_huge_page x86/clear_huge_page change
>> (GB/s) (GB/s)
>>
>> pg-sz=2MB 14.55 19.29 +32.5%
>> pg-sz=1GB 19.34 49.60 +156.4%
>>
>> Milan (and some other AMD Zen uarchs tested) take advantage of the
>> hint to elide cacheline allocation for pg-sz=1GB. The cut-off for
>> this optimization seems to be at around region-size > LLC-size so
>> the pg-sz=2MB load still allocates cachelines.
>>
>
> Have you benchmarked clzero? It is an AMD-specific instruction issuing
> non-temporal stores. It is definitely something to try out for 1G pages.

Thanks for the suggestion. Been a little while, but see the numbers here:
https://lore.kernel.org/linux-mm/20220606203725.1313715-15-ankur.a.arora@xxxxxxxxxx/

> One would think rep stosq has to be at least not worse since the CPU is
> explicitly told what to do and is free to optimize it however it sees
> fit, but the rep prefix has a long history of underperforming.

I agree that historically REP variants have been all over the place.
But, if you look at the numbers, REP; STOS and CLZERO are pretty close,
at least for current generation of AMD uarchs.

Now, current uarch performance is no guarantee for future uarchs, but
if the kernel uses REP; STOS in performance paths, then hopefully
they'll also shows up in internal CPU regression benchmarks which might
mean that the high performance persists.

That said, I think using CLZERO/MOVNT is a good idea -- though, as a
fallback option or where it is better to send an explicit hint while
say, clearing a 2MB region.


Thanks
Ankur

> I'm not saying it is going to be better, but that this should be tested,
> albeit one can easily argue this can be done at a later date.
>
>
> I would do it myself but my access to AMD CPUs is limited.
>
>>
>> *Icelakex* mm/clear_huge_page x86/clear_huge_page change
>> (GB/s) (GB/s)
>>
>> pg-sz=2MB 9.19 12.94 +40.8%
>> pg-sz=1GB 9.36 12.97 +38.5%
>>
>> Icelakex sees a decent improvement in performance but for both
>> region-sizes does continue to allocate cachelines.
>>
>>
>> Negative: there is, a downside to clearing in larger chunks: the
>> current approach clears page-at-a-time, narrowing towards
>> the faulting subpage. This has better cache characteristics for
>> some sequential access workloads where subpages near the faulting
>> page have a greater likelihood of access.
>>
>> I'm not sure if there are real cases which care about this workload
>> but one example is the vm-scalability/case-anon-w-seq-hugetlb test.
>> This test starts a process for each online CPU, with each process
>> writing sequentially to its set of hugepages.
>>
>> The bottleneck here is the memory pipe and so the improvement in
>> stime is limited, and because the clearing is less cache-optimal
>> now, utime suffers from worse user cache misses.
>>
>> *Icelakex* mm/clear_huge_page x86/clear_huge_page change
>> (tasks=128, mem=4GB/task)
>>
>> stime 286.8 +- 3.6% 243.9 +- 4.1% -14.9%
>> utime 497.7 +- 4.1% 553.5 +- 2.0% +11.2%
>> wall-clock 6.9 +- 2.8% 7.0 +- 1.4% + 1.4%
>>
>>
>> *Milan* mm/clear_huge_page x86/clear_huge_page change
>> (mem=1GB/task, tasks=512)
>>
>> stime 501.3 +- 1.4% 498.0 +- 0.9% -0.5%
>> utime 298.7 +- 1.1% 335.0 +- 2.2% +12.1%
>> wall-clock 3.5 +- 2.8% 3.8 +- 2.6% +8.5%
>>
>> The same test performs better if we have a smaller number of processes,
>> since there is more backend BW available, and thus the improved stime
>> compensates for the worse utime.
>>
>> This could be improved by using more circuitous chunking (somewhat
>> like this:
>> https://lore.kernel.org/lkml/20220606203725.1313715-1-ankur.a.arora@xxxxxxxxxx/).
>> But I'm not sure if it is worth doing. Opinions?
>>
>> Patches
>> ==
>>
>> Patch 1, 2, 3:
>> "mm/clear_huge_page: allow arch override for clear_huge_page()",
>> "mm/huge_page: separate clear_huge_page() and copy_huge_page()",
>> "mm/huge_page: cleanup clear_/copy_subpage()"
>> are minor. The first one allows clear_huge_page() to have an
>> arch specific version and the other two are mechanical cleanup
>> patches.
>>
>> Patches 3, 4, 5:
>> "x86/clear_page: extend clear_page*() for multi-page clearing",
>> "x86/clear_page: add clear_pages()",
>> "x86/clear_huge_page: multi-page clearing"
>> define the x86 specific clear_pages() and clear_huge_pages().
>>
>> Patches 6, 7, 8:
>> "sched: define TIF_ALLOW_RESCHED"
>> "irqentry: define irqentry_exit_allow_resched()"
>> which defines allow_resched() to demarcate preemptible sections.
>>
>> This gets used in patch 9:
>> "x86/clear_huge_page: make clear_contig_region() preemptible".
>>
>> Changelog:
>>
>> v2:
>> - Addressed review comments from peterz, tglx.
>> - Removed clear_user_pages(), and CONFIG_X86_32:clear_pages()
>> - General code cleanup
>>
>> Also at:
>> github.com/terminus/linux clear-pages.v2
>>
>> Comments appreciated!
>>
>> Ankur Arora (9):
>> mm/clear_huge_page: allow arch override for clear_huge_page()
>> mm/huge_page: separate clear_huge_page() and copy_huge_page()
>> mm/huge_page: cleanup clear_/copy_subpage()
>> x86/clear_page: extend clear_page*() for multi-page clearing
>> x86/clear_page: add clear_pages()
>> x86/clear_huge_page: multi-page clearing
>> sched: define TIF_ALLOW_RESCHED
>> irqentry: define irqentry_exit_allow_resched()
>> x86/clear_huge_page: make clear_contig_region() preemptible
>>
>> arch/x86/include/asm/page_64.h | 27 +++--
>> arch/x86/include/asm/thread_info.h | 2 +
>> arch/x86/lib/clear_page_64.S | 52 ++++++---
>> arch/x86/mm/hugetlbpage.c | 59 ++++++++++
>> include/linux/entry-common.h | 13 +++
>> include/linux/sched.h | 30 +++++
>> kernel/entry/common.c | 13 ++-
>> kernel/sched/core.c | 32 ++---
>> mm/memory.c | 181 +++++++++++++++++------------
>> 9 files changed, 297 insertions(+), 112 deletions(-)
>>
>> --
>> 2.31.1
>>
>>


--
ankur