Re: [PATCH v3 4/4] x86/folio_zero_user: multi-page clearing
From: Ankur Arora
Date: Tue Apr 15 2025 - 17:47:09 EST
Mateusz Guzik <mjguzik@xxxxxxxxx> writes:
> On Sun, Apr 13, 2025 at 08:46:07PM -0700, Ankur Arora wrote:
>> clear_pages_rep(), clear_pages_erms() use string instructions to zero
>> memory. When operating on more than a single page, we can use these
>> more effectively by explicitly advertising the region-size to the
>> processor, which can use that as a hint to optimize the clearing
>> (ex. by eliding cacheline allocation.)
>>
>> As a secondary benefit, string instructions are typically microcoded,
>> and working with larger regions helps amortize the cost of the decode.
>>
>> When zeroing the 2MB page, maximize spatial locality by clearing in
>> three sections: the faulting page and its immediate neighbourhood, the
>> left and the right regions, with the local neighbourhood cleared last.
>>
>> Performance
>> ==
>>
>> Use mmap(MAP_HUGETLB) to demand fault a 64GB region on the local
>> NUMA node.
>>
>> Milan (EPYC 7J13, boost=0, preempt=full|lazy):
>>
>> mm/folio_zero_user x86/folio_zero_user change
>> (GB/s +- stddev) (GB/s +- stddev)
>>
>> pg-sz=2MB 11.89 +- 0.78% 16.12 +- 0.12% + 35.5%
>> pg-sz=1GB 16.51 +- 0.54% 42.80 +- 3.48% + 159.2%
>>
>> Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
>> allocation, so we see a dropoff in cacheline-allocations for pg-sz=1GB.
>>
>> pg-sz=1GB:
>> - 9,250,034,512 cycles # 2.418 GHz ( +- 0.43% ) (46.16%)
>> - 544,878,976 instructions # 0.06 insn per cycle
>> - 2,331,332,516 L1-dcache-loads # 609.471 M/sec ( +- 0.03% ) (46.16%)
>> - 1,075,122,960 L1-dcache-load-misses # 46.12% of all L1-dcache accesses ( +- 0.01% ) (46.15%)
>>
>> + 3,688,681,006 cycles # 2.420 GHz ( +- 3.48% ) (46.01%)
>> + 10,979,121 instructions # 0.00 insn per cycle
>> + 31,829,258 L1-dcache-loads # 20.881 M/sec ( +- 4.92% ) (46.34%)
>> + 13,677,295 L1-dcache-load-misses # 42.97% of all L1-dcache accesses ( +- 6.15% ) (46.32%)
>>
>> That's not the case with pg-sz=2MB, where we also perform better but
>> the number of cacheline allocations remain the same.
>>
>> It's not entirely clear why the performance for pg-sz=2MB improves. We
>> decode fewer instructions and the hardware prefetcher can do a better
>> job, but the perf stats for both of those aren't convincing enough to
>> the extent of ~30%.
>>
>> pg-sz=2MB:
>> - 13,110,306,584 cycles # 2.418 GHz ( +- 0.48% ) (46.13%)
>> - 607,589,360 instructions # 0.05 insn per cycle
>> - 2,416,130,434 L1-dcache-loads # 445.682 M/sec ( +- 0.08% ) (46.19%)
>> - 1,080,187,594 L1-dcache-load-misses # 44.71% of all L1-dcache accesses ( +- 0.01% ) (46.18%)
>>
>> + 9,624,624,178 cycles # 2.418 GHz ( +- 0.01% ) (46.13%)
>> + 277,336,691 instructions # 0.03 insn per cycle
>> + 2,251,220,599 L1-dcache-loads # 565.624 M/sec ( +- 0.01% ) (46.20%)
>> + 1,092,386,130 L1-dcache-load-misses # 48.52% of all L1-dcache accesses ( +- 0.02% ) (46.19%)
>>
>> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):
>>
>> mm/folio_zero_user x86/folio_zero_user change
>> (GB/s +- stddev) (GB/s +- stddev)
>>
>> pg-sz=2MB 7.95 +- 0.30% 10.90 +- 0.26% + 37.10%
>> pg-sz=1GB 8.01 +- 0.24% 11.26 +- 0.48% + 40.57%
>>
>> For both page-sizes, Icelakex, behaves similarly to Milan pg-sz=2MB: we
>> see a drop in cycles but there's no drop in cacheline allocation.
>>
>
> Back when I was young and handsome and 32-bit x86 was king, people
> assumed 4K pages needed to be cleared with non-temporal stores to avoid
> evicting stuff from caches. I had never seen measurements showing this
> has the intended effect. Some time after this became a thing I did see
> measurements showing that this in fact *increases* cache misses. I am
> not saying this was necessarily the case for all x86 uarchs, merely that
> the sensibly sounding assumption turned bogus at some point (if it was
> ever legit).
That was a long time ago though ;-). And, your point makes sense for
small sized pages. But, consider that zeroing a 1GB page can easily blow
away an L3 cache for absolutely nothing gained -- probabilistically,
nothing that was in the page that remains in the cache will ever be
accessed.
Now, you could argue that the situation is less clear for 2MB pages.
> This brings me to the multi-stage clearing employed here for locality.
> While it sounds great on paper, for all I know it does not provide any
> advantage. It very well may be it is harmful by preventing the CPU from
> knowing what you are trying to do.
>
> I think doing this warrants obtaining stats from some real workloads,
> but given how time consuming this can be I think it would be tolerable
> to skip it for now.
>
>> Performance for preempt=none|voluntary remains unchanged.
>>
>
> So I was under the impression the benefit would be realized for all
> kernels.
>
> I don't know how preemption support is implemented on Linux. Do you
> always get an IPI?
No. The need-resched bit is common. It's just there's no preemption via
irqentry, just synchronous calls to cond_resched() (as you mention below).
Zeroing via a subroutine like instruction (rep; stos) is incompatible with
synchronous calls to cond_resched() so this code is explicitly not called
for none/voluntary (see patch 3.)
That said, I'll probably take Ingo's suggestion of chunking things up
in say 8/16MB portions for cooperative preemption models.
Ankur
> I was thinking something like this: a per-cpu var akin to preemption
> count, but indicating the particular code section is fully preemptible
>
> Then:
>
> preemptible_enter();
> clear_pages();
> preemptible_exit();
>
> for simpler handling of the var it could prevent migration to other
> CPUs.
>
> then the IPI handler for preemption would check if ->preemptible is set
> + preemption disablement is zero, in which case it would take you off
> cpu.
>
> If this is a problem, then a better granularity would help (say 8 pages
> between cond_rescheds?)
>
>> Signed-off-by: Ankur Arora <ankur.a.arora@xxxxxxxxxx>
>> ---
>> arch/x86/mm/Makefile | 1 +
>> arch/x86/mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++++++
>> include/linux/mm.h | 1 +
>> 3 files changed, 62 insertions(+)
>> create mode 100644 arch/x86/mm/memory.c
>>
>> diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
>> index 32035d5be5a0..e61b4d331cdf 100644
>> --- a/arch/x86/mm/Makefile
>> +++ b/arch/x86/mm/Makefile
>> @@ -55,6 +55,7 @@ obj-$(CONFIG_MMIOTRACE_TEST) += testmmiotrace.o
>> obj-$(CONFIG_NUMA) += numa.o numa_$(BITS).o
>> obj-$(CONFIG_AMD_NUMA) += amdtopology.o
>> obj-$(CONFIG_ACPI_NUMA) += srat.o
>> +obj-$(CONFIG_PREEMPTION) += memory.o
>>
>> obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o
>> obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o
>> diff --git a/arch/x86/mm/memory.c b/arch/x86/mm/memory.c
>> new file mode 100644
>> index 000000000000..99851c246fcc
>> --- /dev/null
>> +++ b/arch/x86/mm/memory.c
>> @@ -0,0 +1,60 @@
>> +// SPDX-License-Identifier: GPL-2.0-or-later
>> +#include <linux/mm.h>
>> +#include <linux/range.h>
>> +#include <linux/minmax.h>
>> +
>> +#ifndef CONFIG_HIGHMEM
>> +/*
>> + * folio_zero_user_preemptible(): multi-page clearing variant of folio_zero_user().
>> + *
>> + * Taking inspiration from the common code variant, we split the zeroing in
>> + * three parts: left of the fault, right of the fault, and up to 5 pages
>> + * in the immediate neighbourhood of the target page.
>> + *
>> + * Cleared in that order to keep cache lines of the target region hot.
>> + *
>> + * For gigantic pages, there is no expectation of cache locality so just do a
>> + * straight zero.
>> + */
>> +void folio_zero_user_preemptible(struct folio *folio, unsigned long addr_hint)
>> +{
>> + unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
>> + const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
>> + const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
>> + int width = 2; /* pages cleared last on either side */
>> + struct range r[3];
>> + int i;
>> +
>> + if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
>> + clear_pages(page_address(folio_page(folio, 0)), folio_nr_pages(folio));
>> + goto out;
>> + }
>> +
>> + /*
>> + * Faulting page and its immediate neighbourhood. Cleared at the end to
>> + * ensure it sticks around in the cache.
>> + */
>> + r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
>> + clamp_t(s64, fault_idx + width, pg.start, pg.end));
>> +
>> + /* Region to the left of the fault */
>> + r[1] = DEFINE_RANGE(pg.start,
>> + clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
>> +
>> + /* Region to the right of the fault: always valid for the common fault_idx=0 case. */
>> + r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
>> + pg.end);
>> +
>> + for (i = 0; i <= 2; i++) {
>> + int len = range_len(&r[i]);
>> +
>> + if (len > 0)
>> + clear_pages(page_address(folio_page(folio, r[i].start)), len);
>> + }
>> +
>> +out:
>> + /* Explicitly invoke cond_resched() to handle any live patching necessary. */
>> + cond_resched();
>> +}
>> +
>> +#endif /* CONFIG_HIGHMEM */
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index b7f13f087954..b57512da8173 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -4114,6 +4114,7 @@ enum mf_action_page_type {
>> };
>>
>> #if defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLBFS)
>> +void folio_zero_user_preemptible(struct folio *fio, unsigned long addr_hint);
>> void folio_zero_user(struct folio *folio, unsigned long addr_hint);
>> int copy_user_large_folio(struct folio *dst, struct folio *src,
>> unsigned long addr_hint,
>> --
>> 2.31.1
>>
>>
--
ankur