Re: [RFC PATCH V1 00/13] mm: slowtier page promotion based on PTE A bit

From: Davidlohr Bueso
Date: Wed Mar 19 2025 - 19:11:05 EST

Next message: Rob Herring (Arm): "Re: [PATCH v5 0/2] Add support for Xiaomi Mi TV Stick"
Previous message: Nico Pache: "[PATCH] kunit: cs_dsp: Depend on FW_CS_DSP rather then enabling it"
In reply to: Jonathan Cameron: "Re: [RFC PATCH V1 09/13] mm: Add heuristic to calculate target node"
Next in thread: Raghavendra K T: "Re: [RFC PATCH V1 00/13] mm: slowtier page promotion based on PTE A bit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, 19 Mar 2025, Raghavendra K T wrote:

Introduction:
=============
In the current hot page promotion, all the activities including the
process address space scanning, NUMA hint fault handling and page
migration is performed in the process context. i.e., scanning overhead is
borne by applications.

This is RFC V1 patch series to do (slow tier) CXL page promotion.
The approach in this patchset assists/addresses the issue by adding PTE
Accessed bit scanning.

Scanning is done by a global kernel thread which routinely scans all
the processes' address spaces and checks for accesses by reading the
PTE A bit.

A separate migration thread migrates/promotes the pages to the toptier
node based on a simple heuristic that uses toptier scan/access information
of the mm.

Additionally based on the feedback for RFC V0 [4], a prctl knob with
a scalar value is provided to control per task scanning.

Initial results show promising number on a microbenchmark. Soon
will get numbers with real benchmarks and findings (tunings).

Experiment:
============
Abench microbenchmark,
- Allocates 8GB/16GB/32GB/64GB of memory on CXL node
- 64 threads created, and each thread randomly accesses pages in 4K
granularity.
- 512 iterations with a delay of 1 us between two successive iterations.

SUT: 512 CPU, 2 node 256GB, AMD EPYC.

3 runs, command: abench -m 2 -d 1 -i 512 -s <size>

Calculates how much time is taken to complete the task, lower is better.
Expectation is CXL node memory is expected to be migrated as fast as
possible.

Base case: 6.14-rc6 w/ numab mode = 2 (hot page promotion is enabled).
patched case: 6.14-rc6 w/ numab mode = 1 (numa balancing is enabled).
we expect daemon to do page promotion.

Result:
========
base NUMAB2 patched NUMAB1
time in sec (%stdev) time in sec (%stdev) %gain
8GB 134.33 ( 0.19 ) 120.52 ( 0.21 ) 10.28
16GB 292.24 ( 0.60 ) 275.97 ( 0.18 ) 5.56
32GB 585.06 ( 0.24 ) 546.49 ( 0.35 ) 6.59
64GB 1278.98 ( 0.27 ) 1205.20 ( 2.29 ) 5.76

Base case: 6.14-rc6 w/ numab mode = 1 (numa balancing is enabled).
patched case: 6.14-rc6 w/ numab mode = 1 (numa balancing is enabled).
base NUMAB1 patched NUMAB1
time in sec (%stdev) time in sec (%stdev) %gain
8GB 186.71 ( 0.99 ) 120.52 ( 0.21 ) 35.45
16GB 376.09 ( 0.46 ) 275.97 ( 0.18 ) 26.62
32GB 744.37 ( 0.71 ) 546.49 ( 0.35 ) 26.58
64GB 1534.49 ( 0.09 ) 1205.20 ( 2.29 ) 21.45

Very promising, but a few things. A more fair comparison would be
vs kpromoted using the PROT_NONE of NUMAB2. Essentially disregarding
the asynchronous migration, and effectively measuring synchronous
vs asynchronous scanning overhead and implied semantics. Essentially
save the extra kthread and only have a per-NUMA node migrator, which
is the common denominator for all these sources of hotness.

Similarly, while I don't see any users disabling NUMAB1 _and_ enabling
this sort of thing, it would be useful to have data on no numa balancing
at all. If nothing else, that would measure the effects of the dest
node heuristics.

Also, data/workload involving demotion would also be good to have for
a more complete picture.

Major Changes since V0:
======================
- A separate migration thread is used for migration, thus alleviating need for
multi-threaded scanning (atleast as per tracing).

- A simple heuristic for target node calculation is added.

- prctl (David R) interface with scalar value is added to control per task scanning.

- Steve's comment on tracing incorporated.

- Davidlohr's reported bugfix.

- Initial scan delay similar to NUMAB1 mode added.

- Got rid of migration lock during mm_walk.

PS: Occassionally I do see if scanning is too fast compared to migration,
scanning can stall waiting for lock. Should be fixed in next version by
using memslot for migration..

Disclaimer, Takeaways and discussion points and future TODOs
==============================================================
1) Source code, patch seggregation still to be improved, current patchset only
provides a skeleton.

2) Unification of source of hotness is not easy (as mentioned perhaps by Jonathan)
but perhaps all the consumers/producers can work coopertaively.

Scanning:
3) Major positive: Current patchset is able to cover all the process address
space scanning effectively with simple algorithms to tune scan_size and scan_period.

4) Effective tracking of folio's or address space using / or ideas used in DAMON
is yet to be explored fully.

5) Use timestamp information-based migration (Similar to numab mode=2).
instead of migrating immediately when PTE A bit set.
(cons:
- It will not be accurate since it is done outside of process
context.
- Performance benefit may be lost.)

Migration:

6) Currently fast scanner can bombard migration list, need to maintain migration list in a more
organized way (for e.g. using memslot, so that it is also helpful in maintaining recency, frequency
information (similar to kpromoted posted by Bharata)

7) NUMAB2 throttling is very effective, we would need a common interface to control migration
and also exploit batch migration.

Does NUMAB2 continue to exist? Are there any benefits in having two sources?

Thanks,
Davidlohr

Thanks to Bharata, Joannes, Gregory, SJ, Chris, David Rientjes, Jonathan, John Hubbard,
Davidlohr, Ying, Willy, Hyeonggon Yoo and many of you for your valuable comments and support.

Links:
[1] https://lore.kernel.org/lkml/20241127082201.1276-1-gourry@xxxxxxxxxx/
[2] kstaled: https://lore.kernel.org/lkml/1317170947-17074-3-git-send-email-walken@xxxxxxxxxx/#r
[3] https://lore.kernel.org/lkml/Y+Pj+9bbBbHpf6xM@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
[4] RFC V0: https://lore.kernel.org/all/20241201153818.2633616-1-raghavendra.kt@xxxxxxx/
[5] Recap: https://lore.kernel.org/linux-mm/20241226012833.rmmbkws4wdhzdht6@xxxxxxxx/T/
[6] LSFMM: https://lore.kernel.org/linux-mm/20250123105721.424117-1-raghavendra.kt@xxxxxxx/#r
[7] LSFMM: https://lore.kernel.org/linux-mm/20250131130901.00000dd1@xxxxxxxxxx/

I might have CCed more people or less people than needed
unintentionally.

Patch organization:
patch 1-4 initial skeleton for scanning and migration
patch 5: migration
patch 6-8: scanning optimizations
patch 9: target_node heuristic
patch 10-12: sysfs, vmstat and tracing
patch 13: A basic prctl implementation.

Raghavendra K T (13):
mm: Add kmmscand kernel daemon
mm: Maintain mm_struct list in the system
mm: Scan the mm and create a migration list
mm: Create a separate kernel thread for migration
mm/migration: Migrate accessed folios to toptier node
mm: Add throttling of mm scanning using scan_period
mm: Add throttling of mm scanning using scan_size
mm: Add initial scan delay
mm: Add heuristic to calculate target node
sysfs: Add sysfs support to tune scanning
vmstat: Add vmstat counters
trace/kmmscand: Add tracing of scanning and migration
prctl: Introduce new prctl to control scanning

Documentation/filesystems/proc.rst | 2 +
fs/exec.c | 4 +
fs/proc/task_mmu.c | 4 +
include/linux/kmmscand.h | 31 +
include/linux/migrate.h | 2 +
include/linux/mm.h | 11 +
include/linux/mm_types.h | 7 +
include/linux/vm_event_item.h | 10 +
include/trace/events/kmem.h | 90 ++
include/uapi/linux/prctl.h | 7 +
kernel/fork.c | 8 +
kernel/sys.c | 25 +
mm/Kconfig | 8 +
mm/Makefile | 1 +
mm/kmmscand.c | 1515 ++++++++++++++++++++++++++++
mm/migrate.c | 2 +-
mm/vmstat.c | 10 +
17 files changed, 1736 insertions(+), 1 deletion(-)
create mode 100644 include/linux/kmmscand.h
create mode 100644 mm/kmmscand.c

base-commit: b7f94fcf55469ad3ef8a74c35b488dbfa314d1bb
--
2.34.1

Next message: Rob Herring (Arm): "Re: [PATCH v5 0/2] Add support for Xiaomi Mi TV Stick"
Previous message: Nico Pache: "[PATCH] kunit: cs_dsp: Depend on FW_CS_DSP rather then enabling it"
In reply to: Jonathan Cameron: "Re: [RFC PATCH V1 09/13] mm: Add heuristic to calculate target node"
Next in thread: Raghavendra K T: "Re: [RFC PATCH V1 00/13] mm: slowtier page promotion based on PTE A bit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]