[RFC PATCH 0/5] Enhancements to Page Migration with Batch Offloading via DMA
From: Shivank Garg
Date: Fri Jun 14 2024 - 18:16:18 EST
This series introduces enhancements to the page migration code to optimize
the "folio move" operations by batching them and enable offloading on DMA
hardware accelerators.
Page migration involves three key steps:
1. Unmap: Allocating dst folios and replace the src folio PTEs with
migration PTEs.
2. TLB Flush: Flushing the TLB for all unmapped folios.
3. Move: Copying the page mappings, flags and contents from src to dst.
Update metadata, lists, refcounts and restore working PTEs.
While the first two steps (setting TLB flush pending for unmapped folios
and TLB batch flush) been optimized with batching, this series focuses
on optimizing the folio move step.
In the current design, the folio move operation is performed sequentially
for each folio:
for_each_folio() {
Copy folio metadata like flags and mappings
Copy the folio content from src to dst
Update PTEs with new mappings
}
In the proposed design, we batch the folio copy operations to leverage DMA
offloading. The updated design is as follows:
for_each_folio() {
Copy folio metadata like flags and mappings
}
Batch copy the page content from src to dst by offloading to DMA engine
for_each_folio() {
Update PTEs with new mappings
}
Motivation:
Data copying across NUMA nodes while page migration incurs significant
overhead. For instance, folio copy can take up to 26.6% of the total
migration cost for migrating 256MB of data.
Modern systems are equipped with powerful DMA engines for bulk data
copying. Utilizing these hardware accelerators will become essential for
large-scale tiered-memory systems with CXL nodes where lots of page
promotion and demotion can happen.
Following the trend of batching operations in the memory migration core
path (like batch migration and batch TLB flush), batch copying folio data
is a logical progression in this direction.
We conducted experiments to measure folio copy overheads for page
migration from a remote node to a local NUMA node, modeling page
promotions for different workload sizes (4KB, 2MB, 256MB and 1GB).
Setup Information: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT
Enabled), 1 NUMA node connected to each socket.
Linux Kernel 6.8.0, DVFS set to Performance, and cpuinfo_cur_freq: 2 GHz.
THP, compaction, numa_balancing are disabled to reduce interfernce.
migrate_pages() { <- t1
..
<- t2
folio_copy()
<- t3
..
} <- t4
overheads Fraction, F= (t3-t2)/(t4-t1)
Measurement: Mean ± SD is measured in cpu_cycles/page
Generic Kernel
4KB:: migrate_pages:17799.00±4278.25 folio_copy:794±232.87 F:0.0478±0.0199
2MB:: migrate_pages:3478.42±94.93 folio_copy:493.84±28.21 F:0.1418±0.0050
256MB:: migrate_pages:3668.56±158.47 folio_copy:815.40±171.76 F:0.2206±0.0371
1GB:: migrate_pages:3769.98±55.79 folio_copy:804.68±60.07 F:0.2132±0.0134
Results with patched kernel:
1. Offload disabled - folios batch-move using CPU
4KB:: migrate_pages:14941.60±2556.53 folio_copy:799.60±211.66 F:0.0554±0.0190
2MB:: migrate_pages:3448.44±83.74 folio_copy:533.34±37.81 F:0.1545±0.0085
256MB:: migrate_pages:3723.56±132.93 folio_copy:907.64±132.63 F:0.2427±0.0270
1GB:: migrate_pages:3788.20±46.65 folio_copy:888.46±49.50 F:0.2344±0.0107
2. Offload enabled - folios batch-move using DMAengine
4KB:: migrate_pages:46739.80±4827.15 folio_copy:32222.40±3543.42 F:0.6904±0.0423
2MB:: migrate_pages:13798.10±205.33 folio_copy:10971.60±202.50 F:0.7951±0.0033
256MB:: migrate_pages:13217.20±163.99 folio_copy:10431.20±167.25 F:0.7891±0.0029
1GB:: migrate_pages:13309.70±113.93 folio_copy:10410.00±117.77 F:0.7821±0.0023
Discussion:
The DMAEngine achieved net throughput of 768MB/s. Additional optimizations
are needed to make DMA offloading beneficial compared to CPU-based
migration. This can include parallelism, specialized DMA hardware,
asynchronous and speculative data migration.
Status:
Current patchset is functional, except for non-LRU folios.
Dependencies:
1. This series is based on Linux-v6.8.
2. Patch 1,2,3 involve preparatory work and implementation for batching
the folio move. Patch 4 adds support for DMA offload.
3. DMA hardware and driver support are required to enable DMA offload.
Without suitable support, CPU is used for batch migration. Requirements
are described in Patch 4.
4. Patch 5 adds a DMA driver using DMAengine APIs for end-to-end
testing and validation.
Testing:
The patch series has been tested with migrate_pages(2) and move_pages(2)
using anonymous memory and memory-mapped files.
Byungchul Park (1):
mm: separate move/undo doing on folio list from migrate_pages_batch()
Mike Day (1):
mm: add support for DMA folio Migration
Shivank Garg (3):
mm: add folios_copy() for copying pages in batch during migration
mm: add migrate_folios_batch_move to batch the folio move operations
dcbm: add dma core batch migrator for batch page offloading
drivers/dma/Kconfig | 2 +
drivers/dma/Makefile | 1 +
drivers/dma/dcbm/Kconfig | 7 +
drivers/dma/dcbm/Makefile | 1 +
drivers/dma/dcbm/dcbm.c | 229 +++++++++++++++++++++
include/linux/migrate_dma.h | 36 ++++
include/linux/mm.h | 1 +
mm/Kconfig | 8 +
mm/Makefile | 1 +
mm/migrate.c | 385 +++++++++++++++++++++++++++++++-----
mm/migrate_dma.c | 51 +++++
mm/util.c | 22 +++
12 files changed, 692 insertions(+), 52 deletions(-)
create mode 100644 drivers/dma/dcbm/Kconfig
create mode 100644 drivers/dma/dcbm/Makefile
create mode 100644 drivers/dma/dcbm/dcbm.c
create mode 100644 include/linux/migrate_dma.h
create mode 100644 mm/migrate_dma.c
--
2.34.1