[PATCH v3 00/21] huge page clearing optimizations

From: Ankur Arora
Date: Mon Jun 06 2022 - 16:24:26 EST


This series introduces two optimizations in the huge page clearing path:

1. extends the clear_page() machinery to also handle extents larger
than a single page.
2. support non-cached page clearing for huge and gigantic pages.

The first optimization is useful for hugepage fault handling, the
second for prefaulting, or for gigantic pages.

The immediate motivation is to speedup creation of large VMs backed
by huge pages.

Performance
==

VM creation (192GB VM with prealloc'd 2MB backing pages) sees significant
run-time improvements:

Icelakex:
Time (s) Delta (%)
clear_page_erms() 22.37 ( +- 0.14s ) # 9.21 bytes/ns
clear_pages_erms() 16.49 ( +- 0.06s ) -26.28% # 12.50 bytes/ns
clear_pages_movnt() 9.42 ( +- 0.20s ) -42.87% # 21.88 bytes/ns

Milan:
Time (s) Delta (%)
clear_page_erms() 16.49 ( +- 0.06s ) # 12.50 bytes/ns
clear_pages_erms() 11.82 ( +- 0.06s ) -28.32% # 17.44 bytes/ns
clear_pages_clzero() 4.91 ( +- 0.27s ) -58.49% # 41.98 bytes/ns

As a side-effect, non-polluting clearing by eliding zero filling of
caches also shows better LLC miss rates. For a kbuild+background
page-clearing job, this gives up as a small improvement (~2%) in
runtime.

Discussion
==


With the motivation out of the way, the following note describes
v3's handling of past review comments (and other sticking points for
series of this nature -- especially the non-cached part -- over the
years):

1. Non-cached clearing is unnecessary on x86: x86 already uses 'REP;STOS'
which unlike a MOVNT loop, has semantically richer information available
which can be used by current (and/or future) processors to make the
same cache-elision optimization.

All true, except a) current-gen uarchs often don't and, b) even when
they do, the kernel by clearing at 4K granularity doesn't expose
the extent information in a way that processors could easily
optimize for.

For a), I tested a bunch of REP-STOSB/MOVNTI/CLZERO loops with different
chunk sizes (in user-space over a VA extent of 4GB, page-size=4K.)

Intel Icelake (LLC=48MB, no_turbo=1):

chunk-size REP-STOSB MOVNTI
MBps MBps

4K 9444 24510
64K 11931 24508
2M 12355 24524
8M 12369 24525
32M 12368 24523
128M 12374 24522
1GB 12372 24561

Which is pretty flat across chunk-sizes.


AMD Milan (LLC=32MB, boost=0):

chunk-size REP-STOSB MOVNTI CLZERO
MBps MBps MBps

4K 13034 17815 45579
64K 15196 18549 46038
2M 14821 18581 39064
8M 13964 18557 46045
32M 22525 18560 45969
128M 29311 18581 38924
1GB 35807 18574 45981

The scaling on Milan starts right around chunk=LLC-size. It
asymptotically does seem to get close to CLZERO performance, but the
scaling is linear and not a step function.

For b), as I mention above, the kernel by zeroing at 4K granularity,
doesn't send the right signal to the uarch (though the largest
extent we can use for huge pages is 2MB (and lower for preemptible
kernels), which from these numbers is not large enough.)
Still using clear_page_extent() with larger extents would send the
uarch a hint that it could capitalize on in the future.

This is addressed in patches 1-6:
"mm, huge-page: reorder arguments to process_huge_page()"
"mm, huge-page: refactor process_subpage()"
"clear_page: add generic clear_user_pages()"
"mm, clear_huge_page: support clear_user_pages()"
"mm/huge_page: generalize process_huge_page()"
"x86/clear_page: add clear_pages()"

with patch 5, "mm/huge_page: generalize process_huge_page()"
containing the core logic.

2. Non-caching stores (via MOVNTI, CLZERO on x86) are weakly ordered with
respect to the cache hierarchy and unless they are combined with an
appropriate fence, are unsafe to use.

This is true and is a problem. Patch 12, "sparse: add address_space
__incoherent" adds a new sparse address_space which is used in
the architectural interfaces to make sure that any user is cognizant
of its use:

void clear_user_pages_incoherent(__incoherent void *page, ...)
void clear_pages_incoherent(__incoherent void *page, ...)

One other place it is needed (and is missing) is in highmem:
void clear_user_highpages_incoherent(struct page *page, ...).

Given the natural highmem interface, I couldn't think of a good
way to add the annotation here.

3. Non-caching stores are generally slower than cached for extents
smaller than LLC-size, and faster for larger ones.

This means that if you choose the non-caching path for too small an
extent, you would see performance regressions. There is of course
benefit in not filling the cache with zeroes but that is a somewhat
nebulous advantage and AFAICT there is no representative tests that
probe for it.
(Note that this slowness isn't a consequence of the extra fence --
that is expensive but stops being noticeable for chunk-size >=
~32K-128K depending on uarch.)

This is handled by adding an arch specific threshold (with a
default CLEAR_PAGE_NON_CACHING_THRESHOLD=8MB.) in patches 15 and 16,
"mm/clear_page: add clear_page_non_caching_threshold()",
"x86/clear_page: add arch_clear_page_non_caching_threshold()".

Further, a single call to clear_huge_pages() or get_/pin_user_pages()
might only see a small portion of an extent being cleared in each
iteration. To make sure we choose non-caching stores when working with
large extents, patch 18, "gup: add FOLL_HINT_BULK,
FAULT_FLAG_NON_CACHING", adds a new flag that gup users can use for
this purpose. This is used in patch 20, "vfio_iommu_type1: specify
FOLL_HINT_BULK to pin_user_pages()" while pinning process memory
while attaching passthrough PCIe devices.

The get_user_pages() logic to handle these flags is in patch 19,
"gup: hint non-caching if clearing large regions".

4. Subpoint of 3) above (non-caching stores are faster for extents
larger than LLC-sized) is generally true, with a side of Brownian
motion thrown in. For instance, MOVNTI (for > LLC-size) performs well
on Broadwell and Ice Lake, but on Skylake/Cascade-lake -- sandwiched
in between the two, it does not.

To deal with this, use Ingo's suggestion of "trust but verify",
(https://lore.kernel.org/lkml/20201014153127.GB1424414@xxxxxxxxx/)
where we enable MOVNT by default and only disable it on slow
uarchs.
If the non-caching path ends up being a part of the kernel, uarchs
that regress would hopefully show up early enough in chip testing.

Patch 11, "x86/cpuid: add X86_FEATURE_MOVNT_SLOW" adds this logic
and patch 21, "x86/cpu/intel: set X86_FEATURE_MOVNT_SLOW for
Skylake" disables the non-caching path for Skylake.

Performance numbers are in patches 6 and 19, "x86/clear_page: add
clear_pages()", "gup: hint non-caching if clearing large regions".

Also at:
github.com/terminus/linux clear-page-non-caching.upstream-v3

Comments appreciated!

Changelog
==

v2: https://lore.kernel.org/lkml/20211020170305.376118-1-ankur.a.arora@xxxxxxxxxx/
- Add multi-page clearing: this addresses comments from Ingo
(from v1), and from an offlist discussion with Linus.
- Rename clear_pages_uncached() to make the lack of safety
more obvious: this addresses comments from Andy Lutomorski.
- Simplify the clear_huge_page() changes.
- Usual cleanups etc.
- Rebased to v5.18.


v1: https://lore.kernel.org/lkml/20201014083300.19077-1-ankur.a.arora@xxxxxxxxxx/
- Make the unsafe nature of clear_page_uncached() more obvious.
- Invert X86_FEATURE_NT_GOOD to X86_FEATURE_MOVNT_SLOW, so we don't
have to explicitly enable it for every new model: suggestion from
Ingo Molnar.
- Add GUP path (and appropriate threshold) to allow the uncached path
to be used for huge pages.
- Make the code more generic so it's tied to fewer x86 specific assumptions.

Thanks
Ankur

Ankur Arora (21):
mm, huge-page: reorder arguments to process_huge_page()
mm, huge-page: refactor process_subpage()
clear_page: add generic clear_user_pages()
mm, clear_huge_page: support clear_user_pages()
mm/huge_page: generalize process_huge_page()
x86/clear_page: add clear_pages()
x86/asm: add memset_movnti()
perf bench: add memset_movnti()
x86/asm: add clear_pages_movnt()
x86/asm: add clear_pages_clzero()
x86/cpuid: add X86_FEATURE_MOVNT_SLOW
sparse: add address_space __incoherent
clear_page: add generic clear_user_pages_incoherent()
x86/clear_page: add clear_pages_incoherent()
mm/clear_page: add clear_page_non_caching_threshold()
x86/clear_page: add arch_clear_page_non_caching_threshold()
clear_huge_page: use non-cached clearing
gup: add FOLL_HINT_BULK, FAULT_FLAG_NON_CACHING
gup: hint non-caching if clearing large regions
vfio_iommu_type1: specify FOLL_HINT_BULK to pin_user_pages()
x86/cpu/intel: set X86_FEATURE_MOVNT_SLOW for Skylake

arch/alpha/include/asm/page.h | 1 +
arch/arc/include/asm/page.h | 1 +
arch/arm/include/asm/page.h | 1 +
arch/arm64/include/asm/page.h | 1 +
arch/csky/include/asm/page.h | 1 +
arch/hexagon/include/asm/page.h | 1 +
arch/ia64/include/asm/page.h | 1 +
arch/m68k/include/asm/page.h | 1 +
arch/microblaze/include/asm/page.h | 1 +
arch/mips/include/asm/page.h | 1 +
arch/nios2/include/asm/page.h | 2 +
arch/openrisc/include/asm/page.h | 1 +
arch/parisc/include/asm/page.h | 1 +
arch/powerpc/include/asm/page.h | 1 +
arch/riscv/include/asm/page.h | 1 +
arch/s390/include/asm/page.h | 1 +
arch/sh/include/asm/page.h | 1 +
arch/sparc/include/asm/page_32.h | 1 +
arch/sparc/include/asm/page_64.h | 1 +
arch/um/include/asm/page.h | 1 +
arch/x86/include/asm/cacheinfo.h | 1 +
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/page.h | 26 ++
arch/x86/include/asm/page_64.h | 64 ++++-
arch/x86/kernel/cpu/amd.c | 2 +
arch/x86/kernel/cpu/bugs.c | 30 +++
arch/x86/kernel/cpu/cacheinfo.c | 13 +
arch/x86/kernel/cpu/cpu.h | 2 +
arch/x86/kernel/cpu/intel.c | 2 +
arch/x86/kernel/setup.c | 6 +
arch/x86/lib/clear_page_64.S | 78 ++++--
arch/x86/lib/memset_64.S | 68 ++---
arch/xtensa/include/asm/page.h | 1 +
drivers/vfio/vfio_iommu_type1.c | 3 +
fs/hugetlbfs/inode.c | 7 +-
include/asm-generic/clear_page.h | 69 +++++
include/asm-generic/page.h | 1 +
include/linux/compiler_types.h | 2 +
include/linux/highmem.h | 46 ++++
include/linux/mm.h | 10 +-
include/linux/mm_types.h | 2 +
mm/gup.c | 18 ++
mm/huge_memory.c | 3 +-
mm/hugetlb.c | 10 +-
mm/memory.c | 264 +++++++++++++++----
tools/arch/x86/lib/memset_64.S | 68 ++---
tools/perf/bench/mem-memset-x86-64-asm-def.h | 6 +-
47 files changed, 680 insertions(+), 144 deletions(-)
create mode 100644 include/asm-generic/clear_page.h

--
2.31.1