[PATCH RFC v6 0/5] Accelerate page migration with batch copying and hardware offload

From: Shivank Garg

Date: Tue Jun 30 2026 - 03:31:50 EST

This is the sixth RFC of the patchset to enhance page migration by
batching folio-copy operations and enabling acceleration via DMA offload.
I had intended to split the batch-copy path into a non-RFC series this
round. However, given some major changes, it does not hurt to take
one more pass with the RFC tag.

Single-threaded, folio-by-folio copying bottlenecks page migration in
modern systems with deep memory hierarchies, especially for large folios
where copy overhead dominates, leaving significant hardware potential
untapped.

By batching the copy phase, we create an opportunity for hardware
acceleration. This series builds the framework and provides a DMA
offload driver (dcbm) as a reference implementation, targeting bulk
migration workloads where offloading the copy improves throughput
and latency while freeing CPU cycles.

See the RFC V3 cover letter [3] for motivation.

Changelog since V5:
-------------------
1. Few cleanups and preparatory patches have been split out into a separate
series [12].
2. Rename FOLIO_ALREADY_COPIED to FOLIO_CONTENT_COPIED. (David Hildenbrand)
3. Mark the copy-done state per folio (FOLIO_CONTENT_COPIED) right after
each successful copy, instead of threading an already_copied bool
through migrate_folios_move() -> migrate_folio_move() ->
move_to_new_folio(). This looks clean and lets a partially failed batch
fall back per folio rather than re-copying the whole
batch. (David)
4. Moved folios_mc_copy() into mm/migrate.c as migrate_folios_mc_copy()
to have per-folio copy marker (David). And add cond_resched() (Sashiko)
5. FOLIO_CONTENT_COPIED packs into the unused low bits of the anon_vma
pointer in migrate_info, but 32-bit cannot guarantee bit-2 is free.
The flag is BIT(2) on CONFIG_64BIT and 0 otherwise,
and MIGRATION_COPY_OFFLOAD now depends on 64BIT. (Sashiko, Zi Yan)
6. Move copy-offload interface out of migrate.c into
migrate_copy_offload.c. Restructure patches (migrate_offload_do_batch) (David)
7. Add MAINTAINERS entry for drivers/migrate_offload (David)
8. Avoid DMA-pinned folios on the batch path to avoid stale data race.
(Sashiko). So, could not rename folio_can_batch_copy) to
page-specific name, as it's no longer specific to movable_ops.
9. Add configurable migration reason callback. (David Rientjes)
10. Renamed the dispatch static_call to migrate_offload_batch_copy_fn and
the hot-path predicate to migrate_should_offload().
11. Fix the DCBM crash when build with =y. Expose DCBM runtime
knobs as module parameters under /sys/module/dcbm/parameters instead
of earlier /sys/kernel.
12. Rebased on v7.2-rc1.

DESIGN:
-------

New migration flow:

[ migrate_pages_batch() ]
|
|--> offload = migrate_should_offload(reason) // core filters by migration reason
|
|--> for each folio:
| migrate_folio_unmap() // unmap the folio
| +--> (success):
| if offload && folio_can_batch_copy():
| -> unmap_batch / dst_batch // batch list for copy offloading
| else:
| -> unmap_single / dst_single // single list for per-folio CPU copy
|
|--> try_to_unmap_flush() // single batched TLB flush
|
|--> Batch copy (if unmap_batch not empty):
| - Migrator is configurable at runtime via module parameters.
| static_call(migrate_offload_batch_copy_fn) // Pluggable Migrators
| / | \
| v v v
| [ default ] [ DMA (dcbm) ] [ ... ]
| Each copied dst is marked FOLIO_CONTENT_COPIED; folios left
| unmarked (driver error / copy failure) fall back to per-folio
| CPU copy in the move phase.
|
+--> migrate_folios_move() // metadata, PTEs, finalize

Offload registration:

A driver fills struct migrator { .name, .offload_copy, .owner } and
calls migrate_offload_register(). This:
- pins the module
- patches the migrate_offload_batch_copy_fn static_call target
- enables the migrate_offload_enabled static branch.

migrate_offload_unregister() disables the branch, reverts the
static_call, then synchronize_srcu() waits for in-flight migrations
before module_put().

PERFORMANCE RESULTS:
--------------------

AMD EPYC 7713 (Zen 3), 2 sockets, 32 cores, SMT on,
1 NUMA node per socket, 256 GB/node, v7.2-rc1, DVFS=Performance, PTDMA
(16 DMA channels).

Benchmark: move_pages() syscall to move pages between two NUMA nodes.

1). Moving different sized folios such that total transfer size is constant
(1GB), with different number of DMA channels. Throughput in GB/s.

a. Baseline (vanilla kernel, single-threaded, serial folio_copy):
================================================================================
4K | 16K | 64K | 256K | 1M | 2M |
================================================================================
3.28±0.14 | 4.98±0.18 | 6.19±0.08 | 6.77±0.08 | 7.02±0.11 | 10.80±0.13 |

b. DMA offload (Patched Kernel, dcbm driver, N DMA channels):
============================================================================================
N channel| 4K | 16K | 64K | 256K | 1M | 2M |
============================================================================================
1 | 2.38±0.17 | 2.77±0.03 | 3.21±0.03 | 5.00±0.02 | 5.09±0.64 | 12.62±0.07 |
2 | 2.87±0.11 | 4.06±0.05 | 5.09±0.04 | 6.97±0.08 | 8.43±0.06 | 14.32±0.10 |
4 | 3.32±0.07 | 5.30±0.06 | 7.21±0.09 | 9.69±0.15 | 11.36±0.13 | 26.98±0.19 |
8 | 3.68±0.09 | 6.28±0.10 | 9.16±0.13 | 12.05±0.16 | 15.33±2.80 | 46.06±0.55 |
12 | 3.83±0.05 | 6.65±0.17 | 10.00±0.16 | 12.98±0.18 | 15.87±0.19 | 61.31±1.28 |
16 | 3.94±0.09 | 6.78±0.10 | 10.48±0.13 | 13.48±0.20 | 16.90±0.24 | 65.06±2.46 |

2). First-folio latency: custom tracepoints (in migrate_pages_batch enter/exit,
migrate_folio_done) measure latency per migrate_pages_batch() call.

Throughput (GB/s) and first-folio latency (us), median of 10 runs.

a. Vanilla Kernel:

NR_MAX_BATCHED_MIGRATION upstream default value is 512.
--- Order 0 (4K folios) --- --- Order 9 (2M folios) ---
n vanilla/cpu n vanilla/cpu
(folios) GB/s | first(us) (folios) GB/s | first(us)
-------------------------- --------------------------
1 0.03 | 24 1 6.86 | 204
4 0.13 | 30 4 8.68 | 191
8 0.27 | 27 8 7.92 | 207
16 0.43 | 34 16 6.77 | 234
64 1.12 | 51 64 10.44 | 179
256 1.67 | 166 256 10.43 | 181
512 1.98 | 255 512 10.55 | 179
2048 2.38 | 233
4096 2.42 | 168
16384 2.72 | 167
65536 3.00 | 156
262144 3.10 | 151

b. Patched kernel:
N = NR_MAX_BATCHED_MIGRATION (in pages), Total migrated data fixed at
1 GB. Change N with knob (just for testing) to measure impact of
different max batched size.

--- ORDER 0 (4K folios) ---

N offload/dma1 offload/dma4 offload/dma16
GB/s | first(us) GB/s | first(us) GB/s | first(us)
------------------------------------------------------------------------
512 2.21 | 628 3.29 | 275 3.25 | 245
1024 2.06 | 1271 3.21 | 601 3.36 | 518
2048 2.02 | 2646 3.00 | 1388 3.20 | 1110
4096 2.08 | 4832 3.17 | 2514 3.41 | 2175
8192 2.16 | 9253 3.14 | 4839 3.62 | 3592
16384 2.24 | 17543 3.23 | 9680 3.58 | 7144
32768 2.22 | 36408 3.26 | 19301 3.67 | 14524
65536 2.12 | 82572 3.24 | 38091 3.62 | 29835
131072 2.08 | 153669 3.17 | 79744 3.48 | 62157
262144 2.05 | 332297 2.97 | 175315 3.33 | 134774

--- ORDER 9 (2M folios) ---

N offload/dma1 offload/dma4 offload/dma16
GB/s | first(us) GB/s | first(us) GB/s | first(us)
------------------------------------------------------------------------
512 11.74 | 160 11.71 | 160 11.75 | 159
1024 12.18 | 310 13.82 | 274 13.76 | 275
2048 12.39 | 612 25.55 | 290 25.69 | 289
4096 12.54 | 1211 26.25 | 564 42.36 | 334
8192 12.54 | 2421 26.82 | 1111 51.85 | 485
16384 12.61 | 4824 26.91 | 2209 54.26 | 925
32768 12.62 | 9652 27.04 | 4404 54.72 | 1942
65536 12.64 | 19287 26.95 | 8835 57.30 | 3535
131072 12.64 | 38824 26.95 | 17900 58.58 | 7747
262144 12.66 | 77610 26.95 | 35743 66.31 | 13801

OPEN QUESTION:
--------------

The best batch size depends on the hardware, and bigger isn't always better.
NR_MAX_BATCHED_MIGRATION decides how many pages we move at once.
Higher batch size can help amortize the setup cost of migrator but
increases the first-folio latency (the folio is inaccessible for this
window).

Goals could be workload dependent, e.g. higher throughput versus
same throughput under a bounded latency.

Should this be tunable to accommodate different hardware and goals?

FOLLOW-UPS:
-----------
1. dmaengine_prep_dma_memcpy_sg() in DCBM (Vinod Koul); needs the ptdma/sdxi
SG hook - device_prep_dma_memcpy_sg [10]. will post separately. This will
address two concerns:
- IOMMU SG merging in DCBM: dma_map_sgtable() may merge PFNs
unevenly so src.nents != dst.nents. (Gregory)
- Descriptor Chaining: out-of-order completion of DMA descriptors on some
dma engine eg. Intel DSA can yield incorrect results. (Karim Manaouil)
2. SDXI as a second migrator [11]. SDXI is a generic memcpy engine
without DMA_PRIVATE, so channel acquisition uses dma_find_channel()
rather than dma_request_chan_by_mask(); I have a local DCBM variant
working and will post once SDXI settles.
3. Revisit Multi-threaded CPU copy migrator once the infra is settled or
follow it separately as it brings other discussion, like whom to
charge? [13].
4. Batching the migration rmap walks: try_to_migrate_one() and
remove_migration_pte() overheads dominate for PTE-mapped large
folios. I have this working locally, and will post it separately.

EARLIER POSTINGS:
-----------------
[1] RFC V5: https://lore.kernel.org/all/20260428155043.39251-2-shivankg@xxxxxxx
[2] RFC V4: https://lore.kernel.org/all/20260309120725.308854-3-shivankg@xxxxxxx
[3] RFC V3: https://lore.kernel.org/all/20250923174752.35701-1-shivankg@xxxxxxx
[4] RFC V2: https://lore.kernel.org/all/20250319192211.10092-1-shivankg@xxxxxxx
[5] RFC V1: https://lore.kernel.org/all/20240614221525.19170-1-shivankg@xxxxxxx
[6] RFC from Zi Yan: https://lore.kernel.org/all/20250103172419.4148674-1-ziy@xxxxxxxxxx

RELATED DISCUSSIONS:
--------------------
[7] MM-alignment Session [Nov 12, 2025]:
https://lore.kernel.org/linux-mm/bd6a3c75-b9f0-cbcf-f7c4-1ef5dff06d24@xxxxxxxxxx
[8] Linux Memory Hotness and Promotion call [Nov 6, 2025]:
https://lore.kernel.org/linux-mm/8ff2fd10-c9ac-4912-cf56-7ecd4afd2770@xxxxxxxxxx
[9] LSFMM 2025:
https://lore.kernel.org/all/cf6fc05d-c0b0-4de3-985e-5403977aa3aa@xxxxxxx
[10] DMA_MEMCPY_SG comparison:
https://lore.kernel.org/linux-mm/3e73addb-ac01-4a05-bc75-c6c1c56072df@xxxxxxx
[11] SDXI V3:
https://lore.kernel.org/all/20260605-sdxi-base-v3-0-4d38ca2bdffe@xxxxxxx
[12] migrate cleanups/prep:
https://lore.kernel.org/all/20260626-migrate-cleanups-prep-v1-0-a95933af7619@xxxxxxx
[13] Charge calling thread for multi-threaded copy:
https://lore.kernel.org/all/633F4EFC-13A9-40DF-A27D-DBBDD0AF44F3@xxxxxxxxxx/
[14] OSS India:
https://ossindia2025.sched.com/event/23Jk1

Thanks to everyone who reviewed, tested, or participated in discussions
around this series.

Signed-off-by: Shivank Garg <shivankg@xxxxxxx>
---
Shivank Garg (4):
mm/migrate: skip data copy for already-copied folios
mm/migrate: add batch-copy path in migrate_pages_batch
mm/migrate: add copy offload registration infrastructure
drivers/migrate_offload: add DMA batch copy driver (dcbm)

Zi Yan (1):
mm/migrate: adjust NR_MAX_BATCHED_MIGRATION for testing

MAINTAINERS | 2 +
drivers/Kconfig | 2 +
drivers/Makefile | 2 +
drivers/migrate_offload/Kconfig | 9 +
drivers/migrate_offload/Makefile | 1 +
drivers/migrate_offload/dcbm/Makefile | 1 +
drivers/migrate_offload/dcbm/dcbm.c | 481 ++++++++++++++++++++++++++++++++++
include/linux/migrate.h | 28 ++
include/linux/migrate_copy_offload.h | 69 +++++
mm/Kconfig | 6 +
mm/Makefile | 1 +
mm/migrate.c | 133 +++++++---
mm/migrate_copy_offload.c | 249 ++++++++++++++++++
13 files changed, 954 insertions(+), 30 deletions(-)
---
base-commit: 8b84e29dc92dbcada913c9bab976aa6f761b04e6
change-id: 20260616-shivank-batch-migrate-offload-deac05cbeaa5
prerequisite-change-id: 20260626-migrate-cleanups-prep-77a0ad340cc6:v1
prerequisite-patch-id: 816a6f6957cc9c0f3903db7b7f462a4ade2a7519
prerequisite-patch-id: d8521dbc801fb6e7cbaa5be8cecad491f7e2f809
prerequisite-patch-id: 3fcedadd87cd30b3fe8833e0ed33890f15c0e950

Best regards,
--
Shivank Garg <shivankg@xxxxxxx>