[PATCH RFC V2 0/9] Enhancements to Page Migration with Multi-threading and Batch Offloading to DMA

From: Shivank Garg
Date: Wed Mar 19 2025 - 15:24:57 EST


This patchset introduces enhancements to the page migration by batching
folio-copy operations and using multiple CPU threads for copying or offloading
the copy to DMA hardware.

It builds upon Zi's work on accelerating page migration via multi-threading[1]
and my previous work on enhancing page migration with batch offloading via DMA[2].

MOTIVATION:
-----------
Page migration costs have become increasingly critical in modern systems with
memory-tiers and NUMA nodes:

1. Batching folio copies increases throughput, especially for base page migrations
where kernel activities (moving folio metadata, updating page table entries) create
overhead between individual copies. This is particularly important for smaller
page-sizes (4KB on x86_64/ARM64, 64KB on ARM64).

2. Current simple serial copy patterns underutilize modern hardware capabilities,
leaving memory migration bandwidth capped by single-threaded CPU-bound operations.

These improvements are particularly valuable in:
- Large-scale tiered-memory systems with CXL nodes and HBM
- CPU-GPU coherent systems with GPU memory exposed as NUMA nodes
- Systems where frequent page promotion/demotion occurs

Following the trend of batching operations in the memory migration core path (batch
migration, batch TLB flush), batch copying folio content is the logical next step.
Modern systems equipped with powerful hardware accelerators (DMA engines), GPUs, and
high CPU core counts offer untapped potential for hardware acceleration.

DESIGN:
-------
The patchset implements three key enhancements:

1. Batching:
- Current approach: Process each folio individually
for_each_folio() {
Copy folio metadata like flags and mappings
Copy the folio content from src to dst
Update page tables with dst folio
}

- New approach: Process in batches
for_each_folio() {
Copy folio metadata like flags and mappings
}
Batch copy all src folios to dst
for_each_folio() {
Update page tables with dst folios
}

2. Multi-Threading:
- Distribute folio batch copy operations across multiple CPU threads.

3. DMA Offload:
- Leverage DMA engines designed for high copy throughput.
- Distribute folio batch-copy across mutliple DMA channels.

PERFORMANCE RESULTS:
-------------------
System Info:
Testing environment: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
1 NUMA node per socket, Linux Kernel 6.14.0-rc7+, DVFS set to Performance,
PTDMA hardware.

Measurement: Throughput (GB/s)


1. Varying folio-size with different parallel threads/channels:

Move different sized folios (mTHP - 4KB, 16KB,..., 2MB) such that total transfer
size is constant (1GB), with different number of parallel threads/channels.

a. Multi-Threaded CPU

Folio Size-->
Thread Cnt | 4K | 16K | 32K | 64K | 128K | 256K | 512K | 1M | 2M |
===============================================================================================================
1 | 1.72±0.05| 3.55±0.14| 4.44±0.07| 5.19±0.37| 5.57±0.47| 6.27±0.02 | 6.43±0.09 | 6.59±0.05 | 10.73±0.07|
2 | 1.93±0.06| 3.91±0.24| 5.22±0.03| 5.76±0.62| 7.42±0.16| 7.30±0.93 | 8.08±0.85 | 8.67±0.09 | 17.21±0.28|
4 | 2.00±0.03| 4.30±0.22| 6.02±0.10| 7.61±0.26| 8.60±0.92| 9.54±1.11 | 10.03±1.12| 10.98±0.14| 29.61±0.43|
8 | 2.07±0.08| 4.60±0.32| 6.06±0.85| 7.52±0.96| 7.98±1.83| 8.66±1.94 | 10.99±1.40| 11.22±1.49| 37.42±0.70|
16 | 2.04±0.04| 4.74±0.31| 6.20±0.39| 7.51±0.86| 8.26±1.47| 10.99±0.11| 9.72±1.51 | 12.07±0.02| 37.08±0.53|

b. DMA Offload

Folio Size-->
Channel Cnt| 4K | 16K | 32K | 64K | 128K | 256K | 512K | 1M | 2M |
============================================================================================================
1 | 0.46±0.01| 1.35±0.02| 1.99±0.02| 2.76±0.02| 3.44±0.17| 3.87±0.20| 3.98±0.29| 4.36±0.01| 11.79±0.05|
2 | 0.66±0.02| 1.84±0.07| 2.89±0.10| 4.02±0.30| 4.27±0.53| 5.98±0.05| 6.15±0.50| 5.83±0.64| 13.39±0.08|
4 | 0.91±0.01| 2.62±0.13| 3.98±0.17| 5.57±0.41| 6.55±0.70| 8.32±0.04| 8.91±0.05| 8.82±0.96| 24.52±0.22|
8 | 1.14±0.00| 3.21±0.07| 4.21±1.09| 6.07±0.81| 8.80±0.08| 8.91±1.38|11.03±0.02|10.68±1.38| 39.17±0.58|
16 | 1.19±0.11| 3.33±0.20| 4.98±0.33| 7.65±0.10| 7.85±1.50| 8.38±1.35| 8.94±3.23|12.85±0.06| 55.45±1.20|

Inference:
- Throughput increases with folio size. Higher Size folios benefit more from DMA.
- Multi-threading and DMA offloading both provide significant gains.


2. Varying folio count (total transfer size)
2MB folio-size, use only 1 thread

a. CPU Multi-Threaded
Folio Count| GB/s
======================
1 | 7.56±3.23
8 | 9.54±1.34
64 | 9.57±0.39
256 | 10.09±0.17
512 | 10.61±0.17
1024 | 10.77±0.07
2048 | 10.81±0.08
8192 | 10.84±0.05

b. DMA offload
Folio Count| GB/s
======================
1 | 8.21±3.68
8 | 9.92±2.12
64 | 9.90±0.31
256 | 11.51±0.32
512 | 11.67±0.11
1024 | 11.89±0.06
2048 | 11.92±0.08
8192 | 12.03±0.05

Inference:
- Throughput increase with folios count but plateaus after a threshold.
(The migrate_pages function uses a folio batch size of 512)

3. CPU Threads scheduling
Analyze effect of CPU topology

a. Spread Across different CCDs
Threads | GB/s
========================
1 | 10.60±0.06
2 | 17.21±0.12
4 | 29.94±0.16
8 | 37.07±1.62
16 | 36.19±0.97

b. Fill one CCD completely before moving to next one
Threads | GB/s
========================
1 | 10.44±0.47
2 | 10.93±0.11
4 | 10.99±0.04
8 | 11.08±0.03
16 | 17.91±0.12

Inference:
- Hardware topology matters. On AMD systems, distributing copy threads across
CCDs utilizes bandwidth better

TODOs:
We can further experiments to:
- Characterize system behavior and develop heuristics
- Analyze remote/local CPU scheduling impacts
- Measure DMA setup overheads
- Evaluate costs to userspace
- Study cache hotness/pollution effects
- DMA cost with different system I/O load

[1] https://lore.kernel.org/linux-mm/20250103172419.4148674-1-ziy@xxxxxxxxxx
[2] https://lore.kernel.org/linux-mm/20240614221525.19170-1-shivankg@xxxxxxx
[3] LSFMM Proposal: https://lore.kernel.org/all/cf6fc05d-c0b0-4de3-985e-5403977aa3aa@xxxxxxx

Mike Day (1):
mm: add support for copy offload for folio Migration

Shivank Garg (4):
mm: batch folio copying during migration
mm/migrate: add migrate_folios_batch_move to batch the folio move
operations
dcbm: add dma core batch migrator for batch page offloading
mtcopy: spread threads across die for testing

Zi Yan (4):
mm/migrate: factor out code in move_to_new_folio() and
migrate_folio_move()
mm/migrate: revive MIGRATE_NO_COPY in migrate_mode.
mm/migrate: introduce multi-threaded page copy routine
adjust NR_MAX_BATCHED_MIGRATION for testing

drivers/Kconfig | 2 +
drivers/Makefile | 3 +
drivers/migoffcopy/Kconfig | 17 ++
drivers/migoffcopy/Makefile | 2 +
drivers/migoffcopy/dcbm/Makefile | 1 +
drivers/migoffcopy/dcbm/dcbm.c | 393 ++++++++++++++++++++++++
drivers/migoffcopy/mtcopy/Makefile | 1 +
drivers/migoffcopy/mtcopy/copy_pages.c | 408 +++++++++++++++++++++++++
include/linux/migrate_mode.h | 2 +
include/linux/migrate_offc.h | 36 +++
include/linux/mm.h | 4 +
mm/Kconfig | 8 +
mm/Makefile | 1 +
mm/migrate.c | 351 ++++++++++++++++++---
mm/migrate_offc.c | 51 ++++
mm/util.c | 41 +++
16 files changed, 1275 insertions(+), 46 deletions(-)
create mode 100644 drivers/migoffcopy/Kconfig
create mode 100644 drivers/migoffcopy/Makefile
create mode 100644 drivers/migoffcopy/dcbm/Makefile
create mode 100644 drivers/migoffcopy/dcbm/dcbm.c
create mode 100644 drivers/migoffcopy/mtcopy/Makefile
create mode 100644 drivers/migoffcopy/mtcopy/copy_pages.c
create mode 100644 include/linux/migrate_offc.h
create mode 100644 mm/migrate_offc.c

--
2.34.1