[PATCH 0/9] x86/clear_huge_page: multi-page clearing

From: Ankur Arora
Date: Mon Apr 03 2023 - 01:23:58 EST


This series introduces multi-page clearing for hugepages.

This is a follow up of some of the ideas discussed at:
https://lore.kernel.org/lkml/CAHk-=wj9En-BC4t7J9xFZOws5ShwaR9yor7FxHZr8CTVyEP_+Q@xxxxxxxxxxxxxx/

On x86 page clearing is typically done via string intructions. These,
unlike a MOV loop, allow us to explicitly advertise the region-size to
the processor, which could serve as a hint to current (and/or
future) uarchs to elide cacheline allocation.

In current generation processors, Milan (and presumably other Zen
variants) use the hint to elide cacheline allocation (for
region-size > LLC-size.)

An additional reason for doing this is that string instructions are typically
microcoded, and clearing in bigger chunks than the current page-at-a-
time logic amortizes some of the cost.

All uarchs tested (Milan, Icelakex, Skylakex) showed improved performance.

There are, however, some problems:

1. extended zeroing periods means there's an increased latency due to
the now missing preemption points.

That's handled in patches 7, 8, 9:
"sched: define TIF_ALLOW_RESCHED"
"irqentry: define irqentry_exit_allow_resched()"
"x86/clear_huge_page: make clear_contig_region() preemptible"
by the context marking itself reschedulable, and rescheduling in
irqexit context if needed (for PREEMPTION_NONE/_VOLUNTARY.)

2. the current page-at-a-time clearing logic does left-right narrowing
towards the faulting page which benefits workloads by maintaining
cache locality for workloads which have a sequential pattern. Clearing
in large chunks loses that.

Some (but not all) of that could be ameliorated by something like
this patch:
https://lore.kernel.org/lkml/20220606203725.1313715-1-ankur.a.arora@xxxxxxxxxx/

But, before doing that I'd like some comments on whether that is
worth doing for this specific use case?

Rest of the series:
Patches 1, 2, 3:
"huge_pages: get rid of process_huge_page()"
"huge_page: get rid of {clear,copy}_subpage()"
"huge_page: allow arch override for clear/copy_huge_page()"
are mechanical and they simplify some of the current clear_huge_page()
logic.

Patches 4, 5:
"x86/clear_page: parameterize clear_page*() to specify length"
"x86/clear_pages: add clear_pages()"

add clear_pages() and helpers.

Patch 6: "mm/clear_huge_page: use multi-page clearing" adds the
chunked x86 clear_huge_page() implementation.


Performance
==

Demand fault performance gets a decent boost:

*Icelakex* mm/clear_huge_page x86/clear_huge_page change
(GB/s) (GB/s)

pg-sz=2MB 8.76 11.82 +34.93%
pg-sz=1GB 8.99 12.18 +35.48%


*Milan* mm/clear_huge_page x86/clear_huge_page change
(GB/s) (GB/s)

pg-sz=2MB 12.24 17.54 +43.30%
pg-sz=1GB 17.98 37.24 +107.11%


vm-scalability/case-anon-w-seq-hugetlb, gains in stime but performs
worse when user space tries to touch those pages:

*Icelakex* mm/clear_huge_page x86/clear_huge_page change
(mem=4GB/task, tasks=128)

stime 293.02 +- .49% 239.39 +- .83% -18.30%
utime 440.11 +- .28% 508.74 +- .60% +15.59%
wall-clock 5.96 +- .33% 6.27 +-2.23% + 5.20%


*Milan* mm/clear_huge_page x86/clear_huge_page change
(mem=1GB/task, tasks=512)

stime 490.95 +- 3.55% 466.90 +- 4.79% - 4.89%
utime 276.43 +- 2.85% 311.97 +- 5.15% +12.85%
wall-clock 3.74 +- 6.41% 3.58 +- 7.82% - 4.27%

Also at:
github.com/terminus/linux clear-pages.v1

Comments appreciated!

Ankur Arora (9):
huge_pages: get rid of process_huge_page()
huge_page: get rid of {clear,copy}_subpage()
huge_page: allow arch override for clear/copy_huge_page()
x86/clear_page: parameterize clear_page*() to specify length
x86/clear_pages: add clear_pages()
mm/clear_huge_page: use multi-page clearing
sched: define TIF_ALLOW_RESCHED
irqentry: define irqentry_exit_allow_resched()
x86/clear_huge_page: make clear_contig_region() preemptible

arch/x86/include/asm/page.h | 6 +
arch/x86/include/asm/page_32.h | 6 +
arch/x86/include/asm/page_64.h | 25 +++--
arch/x86/include/asm/thread_info.h | 2 +
arch/x86/lib/clear_page_64.S | 45 ++++++--
arch/x86/mm/hugetlbpage.c | 59 ++++++++++
include/linux/sched.h | 29 +++++
kernel/entry/common.c | 8 ++
kernel/sched/core.c | 36 +++---
mm/memory.c | 174 +++++++++++++++--------------
10 files changed, 270 insertions(+), 120 deletions(-)

--
2.31.1