[RFC PATCH V0 0/10] mm: slowtier page promotion based on PTE A bit
From: Raghavendra K T
Date: Sun Dec 01 2024 - 10:39:01 EST
Introduction:
=============
This patchset is an outcome of an ongoing collaboration between AMD and Meta.
Meta wanted to explore an alternative page promotion technique as they
observe high latency spikes in their workloads that access CXL memory.
In the current hot page promotion, all the activities including the
process address space scanning, NUMA hint fault handling and page
migration is performed in the process context. i.e., scanning overhead is
borne by applications.
This is an early RFC patch series to do (slow tier) CXL page promotion.
The approach in this patchset assists/addresses the issue by adding PTE
Accessed bit scanning.
Scanning is done by a global kernel thread which routinely scans all
the processes' address spaces and checks for accesses by reading the
PTE A bit. It then migrates/promotes the pages to the toptier node
(node 0 in the current approach).
Thus, the approach pushes overhead of scanning, NUMA hint faults and
migrations off from process context.
Initial results show promising number on a microbenchmark.
Experiment:
============
Abench microbenchmark,
- Allocates 8GB/32GB of memory on CXL node
- 64 threads created, and each thread randomly accesses pages in 4K
granularity.
- 512 iterations with a delay of 1 us between two successive iterations.
SUT: 512 CPU, 2 node 256GB, AMD EPYC.
3 runs, command: abench -m 2 -d 1 -i 512 -s <size>
Calculates how much time is taken to complete the task, lower is better.
Expectation is CXL node memory is expected to be migrated as fast as
possible.
Base case: 6.11-rc6 w/ numab mode = 2 (hot page promotion is enabled).
patched case: 6.11-rc6 w/ numab mode = 0 (numa balancing is disabled).
we expect daemon to do page promotion.
Result [*]:
========
base patched
time in sec (%stdev) time in sec (%stdev) %gain
8GB 133.66 ( 0.38 ) 113.77 ( 1.83 ) 14.88
32GB 584.77 ( 0.19 ) 542.79 ( 0.11 ) 7.17
[*] Please note current patchset applies on 6.13-rc, but these results
are old because latest kernel has issues in populating CXL node memory.
Emailing findings/fix on that soon.
Overhead:
The below time is calculated using patch 10. Actual overhead for patched
case may be even lesser.
(scan + migration) time in sec
Total memory base kernel patched kernel %gain
8GB 65.743 13.93 78.8114324
32GB 153.95 132.12 14.17992855
Breakup for 8GB base patched
numa_task_work_oh 0.883 0
numa_hf_migration_oh 64.86 0
kmmscand_scan_oh 0 2.74
kmmscand_migration_oh 0 11.19
Breakup for 32GB base patched
numa_task_work_oh 4.79 0
numa_hf_migration_oh 149.16 0
kmmscand_scan_oh 0 23.4
kmmscand_migration_oh 0 108.72
Limitations:
===========
PTE A bit scanning approach lacks information about exact destination
node to migrate to.
Notes/Observations on design/Implementations/Alternatives/TODOs...
================================
1. Fine-tuning scan throttling
2. Use migrate_balanced_pgdat() to balance toptier node before migration
OR Use migrate_misplaced_folio_prepare() directly.
But it may need some optimizations (for e.g., invoke occasionaly so
that overhead is not there for every migration).
3. Explore if a separate PAGE_EXT flag is needed instead of reusing
PAGE_IDLE flag (cons: complicates PTE A bit handling in the system),
But practically does not look good idea.
4. Use timestamp information-based migration (Similar to numab mode=2).
instead of migrating immediately when PTE A bit set.
(cons:
- It will not be accurate since it is done outside of process
context.
- Performance benefit may be lost.)
5. Explore if we need to use PFN information + hash list instead of
simple migration list. Here scanning is directly done with PFN belonging
to CXL node.
6. Holding PTE lock before migration.
7. Solve: how to find target toptier node for migration.
8. Using DAMON APIs OR Reusing part of DAMON which already tracks range of
physical addresses accessed.
9. Gregory has nicely mentioned some details/ideas on different approaches in
[1] : development notes, in the context of promoting unmapped page cache folios.
10. SJ had pointed about concerns about kernel-thread based approaches as in
kstaled [2]. So current patchset has tried to address the issue with simple
algorithms to reduce CPU overhead. Migration throttling, Running the daemon
in NICE priority, Parallelizing migration with scanning could help further.
11. Toptier pages scanned can be used to assist current NUMAB by providing information
on hot VMAs.
Credits
=======
Thanks to Bharata, Joannes, Gregory, SJ, Chris for their valuable comments and
support.
Kernel thread skeleton and some part of the code is hugely inspired by khugepaged
implementation and some part of IBS patches from Bharata [3].
Looking forward for your comment on whether the current approach in this
*early* RFC looks promising, or are there any alternative ideas etc.
Links:
[1] https://lore.kernel.org/lkml/20241127082201.1276-1-gourry@xxxxxxxxxx/
[2] kstaled: https://lore.kernel.org/lkml/1317170947-17074-3-git-send-email-walken@xxxxxxxxxx/#r
[3] https://lore.kernel.org/lkml/Y+Pj+9bbBbHpf6xM@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
I might have CCed more people or less people than needed
unintentionally.
Raghavendra K T (10):
mm: Add kmmscand kernel daemon
mm: Maintain mm_struct list in the system
mm: Scan the mm and create a migration list
mm/migration: Migrate accessed folios to toptier node
mm: Add throttling of mm scanning using scan_period
mm: Add throttling of mm scanning using scan_size
sysfs: Add sysfs support to tune scanning
vmstat: Add vmstat counters
trace/kmmscand: Add tracing of scanning and migration
kmmscand: Add scanning
fs/exec.c | 4 +
include/linux/kmmscand.h | 30 +
include/linux/mm.h | 14 +
include/linux/mm_types.h | 4 +
include/linux/vm_event_item.h | 14 +
include/trace/events/kmem.h | 99 +++
kernel/fork.c | 4 +
kernel/sched/fair.c | 13 +-
mm/Kconfig | 7 +
mm/Makefile | 1 +
mm/huge_memory.c | 1 +
mm/kmmscand.c | 1144 +++++++++++++++++++++++++++++++++
mm/memory.c | 12 +-
mm/vmstat.c | 14 +
14 files changed, 1352 insertions(+), 9 deletions(-)
create mode 100644 include/linux/kmmscand.h
create mode 100644 mm/kmmscand.c
base-commit: bcc8eda6d34934d80b96adb8dc4ff5dfc632a53a
--
2.39.3