[PATCH 0/8] Use uncached writes while clearing gigantic pages

From: Ankur Arora
Date: Wed Oct 14 2020 - 04:32:50 EST


This series adds clear_page_nt(), a non-temporal MOV (MOVNTI) based
clear_page().

The immediate use case is to speedup creation of large (~2TB) guests
VMs. Memory for these guests is allocated via huge/gigantic pages which
are faulted in early.

The intent behind using non-temporal writes is to minimize allocation of
unnecessary cachelines. This helps in minimizing cache pollution, and
potentially also speeds up zeroing of large extents.

That said there are, uncached writes are not always great, as can be seen
in these 'perf bench mem memset' numbers comparing clear_page_erms() and
clear_page_nt():

Intel Broadwellx:
x86-64-stosb (5 runs) x86-64-movnt (5 runs) speedup
----------------------- ----------------------- -------
size BW ( pstdev) BW ( pstdev)
16MB 17.35 GB/s ( +- 9.27%) 11.83 GB/s ( +- 0.19%) -31.81%
128MB 5.31 GB/s ( +- 0.13%) 11.72 GB/s ( +- 0.44%) +121.84%

AMD Rome:
x86-64-stosq (5 runs) x86-64-movnt (5 runs) speedup
----------------------- ----------------------- -------
size BW ( pstdev) BW ( pstdev)
16MB 15.39 GB/s ( +- 9.14%) 14.56 GB/s ( +-19.43%) -5.39%
128MB 11.04 GB/s ( +- 4.87%) 14.49 GB/s ( +-13.22%) +31.25%

Intel Skylakex:
x86-64-stosb (5 runs) x86-64-movnt (5 runs) speedup
----------------------- ----------------------- -------
size BW ( pstdev) BW ( pstdev)
16MB 20.38 GB/s ( +- 2.58%) 6.25 GB/s ( +- 0.41%) -69.28%
128MB 6.52 GB/s ( +- 0.14%) 6.31 GB/s ( +- 0.47%) -3.22%

(All of the machines in these tests had a minimum of 25MB L3 cache per
socket.)

There are two performance issues:
- uncached writes typically perform better only for region sizes
sizes around or larger than ~LLC-size.
- MOVNTI does not always perform well on all microarchitectures.

We handle the first issue by only using clear_page_nt() for GB pages.

That leaves out page zeroing for 2MB pages, which is a size that's large
enough that uncached writes might have meaningful cache benefits but at
the same time is small enough that uncached writes would end up being
slower.

We can handle a subset of the 2MB case -- mmaps with MAP_POPULATE -- by
means of a uncached-or-cached hint decided based on a threshold size. This
would apply to maps backed by any page-size.
This case is not handled in this series -- I wanted to sanity check the
high level approach before attempting that.

Handle the second issue by adding a synthetic cpu-feature,
X86_FEATURE_NT_GOOD which is only enabled for architectures where MOVNTI
performs well.
(Relatedly, I thought I had independently decided to use ALTERNATIVES
to deal with this, but more likely I had just internalized it from this
discussion:
https://lore.kernel.org/linux-mm/20200316101856.GH11482@xxxxxxxxxxxxxx/#t)

Accordingly this series enables X86_FEATURE_NT_GOOD for Intel Broadwellx
and AMD Rome. (In my testing, the performance was also good for some
pre-production models but this series leaves them out.)

Please review.

Thanks
Ankur

Ankur Arora (8):
x86/cpuid: add X86_FEATURE_NT_GOOD
x86/asm: add memset_movnti()
perf bench: add memset_movnti()
x86/asm: add clear_page_nt()
x86/clear_page: add clear_page_uncached()
mm, clear_huge_page: use clear_page_uncached() for gigantic pages
x86/cpu/intel: enable X86_FEATURE_NT_GOOD on Intel Broadwellx
x86/cpu/amd: enable X86_FEATURE_NT_GOOD on AMD Zen

arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/page.h | 6 +++
arch/x86/include/asm/page_32.h | 9 ++++
arch/x86/include/asm/page_64.h | 15 ++++++
arch/x86/kernel/cpu/amd.c | 3 ++
arch/x86/kernel/cpu/intel.c | 2 +
arch/x86/lib/clear_page_64.S | 26 +++++++++++
arch/x86/lib/memset_64.S | 68 ++++++++++++++++------------
include/asm-generic/page.h | 3 ++
include/linux/highmem.h | 10 ++++
mm/memory.c | 3 +-
tools/arch/x86/lib/memset_64.S | 68 ++++++++++++++++------------
tools/perf/bench/mem-memset-x86-64-asm-def.h | 6 ++-
13 files changed, 158 insertions(+), 62 deletions(-)

--
2.9.3