Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload

From: Garg, Shivank

Date: Fri May 08 2026 - 07:07:26 EST




On 4/30/2026 2:17 PM, Huang, Ying wrote:
> Shivank Garg <shivankg@xxxxxxx> writes:

>> PERFORMANCE RESULTS:
>> --------------------
>>
>> Re-ran the V4 workload on v7.1-rc1 with this series; relative
>> speedups match V4 (~6x for 2MB folios at 16 DMA channels). No design
>> change in V5 alters this picture; please refer to the V4 cover letter
>> for the throughput tables [1].
>
> IMHO, it's better to copy performance data here.
>
> In addition to the performance benefit, I want to know the downside as
> well. For example, the migration latency of the first folio may be
> longer. If so, by how much? Can you measure the batch number vs. total
> migration time (benefit) and first folio migration time (downside)?
> That can be used to determine the optimal batch number.
>

System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
1 NUMA node per socket, v7.1-rc1, DVFS set to Performance, PTDMA hardware.

Benchmark: move_pages() syscall to move pages between two NUMA nodes.

1). Moving different sized folios such that total transfer size is constant
(1GB), with different number of DMA channels. Throughput in GB/s.

a. Baseline (vanilla kernel, single-threaded, serial folio_copy):

================================================================================
4K | 16K | 64K | 256K | 1M | 2M |
================================================================================
3.31±0.18 | 5.61±0.07 | 6.66±0.03 | 7.01±0.03 | 7.13±0.08 | 11.02±0.17 |


b. DMA offload (Patched Kernel, dcbm driver, N DMA channels):

============================================================================================
N channel| 4K | 16K | 64K | 256K | 1M | 2M |
============================================================================================
1 | 2.16±0.14 | 2.58±0.02 | 3.00±0.04 | 4.56±0.28 | 4.62±0.02 | 12.65±0.08 |
2 | 2.68±0.09 | 3.69±0.15 | 4.52±0.04 | 6.75±0.06 | 7.19±0.19 | 14.38±0.06 |
4 | 3.07±0.13 | 4.62±0.09 | 6.47±0.56 | 9.22±0.15 | 10.24±0.47 | 27.01±0.11 |
8 | 3.43±0.09 | 5.40±0.16 | 7.67±0.08 | 11.25±0.17 | 12.60±0.60 | 45.62±0.52 |
12 | 3.50±0.11 | 5.66±0.16 | 8.12±0.10 | 11.97±0.19 | 13.43±0.08 | 61.02±0.92 |
16 | 3.54±0.12 | 5.79±0.14 | 8.50±0.13 | 12.59±0.15 | 17.21±6.40 | 65.23±1.70 |


2). First-folio latency: Instrumented with custom tracepoints to measure latency per migrate_pages_batch() call.
Result: throughput (GB/s) and first-folio latency (in microseconds), median of 10 runs.

A). Vanilla Kernel:

Here, n = workload size passed to move_pages() in folios. Move n number of folios with move_pages().
NR_MAX_BATCHED_MIGRATION is upstream default value 512.

--- Order 0 (4K folios) ---
n vanilla/cpu
(folios) GB/s | first(us)
--------------------------
1 0.04 | 24
4 0.16 | 25
8 0.29 | 31
16 0.54 | 27
64 1.15 | 68
256 1.86 | 162
512 2.21 | 264
2048 2.62 | 208
4096 2.74 | 182
16384 2.73 | 173
65536 3.28 | 166
262144 3.20 | 167

--- Order 9 (2M folios) ---
n vanilla/cpu
(folios) GB/s | first(us)
--------------------------
1 7.05 | 194
4 8.78 | 186
8 8.47 | 188
16 7.20 | 193
64 8.23 | 191
256 10.51 | 180
512 10.88 | 173

Takeaway:
In each migrate_pages_batch() call, folios are first unmapped, then try_to_unmap_flush(),
and only then folios enter move_to_new_folio(). So first-folio latency is bounded by the
per-batch unmap+flush cost, and then plateaus once workload is large enough.


B). Patched kernel:

Here, N = NR_MAX_BATCHED_MIGRATION (in page). Total migrated data is fixed at 1 GB.
Change N with a knob to measure impact of different max batched size.

--- ORDER 0 (4K folios) ---
N offload/dma1 offload/dma4 offload/dma16
GB/s | first(us) GB/s | first(us) GB/s | first(us)
------------------------------------------------------------------------
512 2.13 | 639 3.23 | 290 3.27 | 253
1024 2.17 | 1261 3.44 | 582 3.58 | 536
2048 2.01 | 2769 3.09 | 1360 3.45 | 1083
4096 2.10 | 5059 3.13 | 2737 3.58 | 2115
8192 2.21 | 9320 3.17 | 5015 3.75 | 3617
16384 2.15 | 18689 3.31 | 9623 3.87 | 6937
32768 2.12 | 42692 3.38 | 18893 3.83 | 14255
65536 2.09 | 81956 3.38 | 38556 3.64 | 29003
131072 2.02 | 169563 3.22 | 81082 3.63 | 62236
262144 2.21 | 318424 3.12 | 170174 3.50 | 129413

--- ORDER 9 (2M folios) ---
N offload/dma1 offload/dma4 offload/dma16
GB/s | first(us) GB/s | first(us) GB/s | first(us)
-------------------------------------------------------------------------
512 11.66 | 160 11.68 | 160 11.65 | 160
1024 12.16 | 310 13.67 | 275 13.64 | 276
2048 12.30 | 613 25.47 | 290 25.48 | 291
4096 12.48 | 1215 26.19 | 566 42.59 | 335
8192 12.56 | 2424 26.57 | 1118 58.72 | 470 *
16384 12.61 | 4839 26.77 | 2218 61.94 | 896
32768 12.60 | 9667 26.98 | 4422 63.75 | 1748
65536 12.63 | 19318 26.99 | 8838 60.66 | 3543
131072 12.64 | 38935 27.02 | 17935 61.06 | 7178
262144 12.66 | 77694 26.85 | 35871 65.06 | 14129

In the batch-copy offload approach, DMA copy phase is inserted between unmap/flush and move,
So larger N increases first-folio wall clock latency. Throughput improves but with diminishing
returns.

For DCBM+PTDMA setup, the optimal batch for 2M folios sits around N=8192-16384,
because a larger batch allows the driver to distribute more folios across available DMA channels.
This is where we get most throughput while keeping the first folio latency in check.

This optimal batch value is hardware-specific. Other engines (eg. SDXI) and memory tier (eg. CXL)
will likely have different curves.

Does this approach and experiment look good to you?

Thanks,
Shivank