Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload

From: Huang, Ying

Date: Fri May 08 2026 - 07:33:17 EST

Hi, Shivank,

"Garg, Shivank" <shivankg@xxxxxxx> writes:

> On 4/30/2026 2:17 PM, Huang, Ying wrote:
>> Shivank Garg <shivankg@xxxxxxx> writes:
>
>>> PERFORMANCE RESULTS:
>>> --------------------
>>>
>>> Re-ran the V4 workload on v7.1-rc1 with this series; relative
>>> speedups match V4 (~6x for 2MB folios at 16 DMA channels). No design
>>> change in V5 alters this picture; please refer to the V4 cover letter
>>> for the throughput tables [1].
>>
>> IMHO, it's better to copy performance data here.
>>
>> In addition to the performance benefit, I want to know the downside as
>> well. For example, the migration latency of the first folio may be
>> longer. If so, by how much? Can you measure the batch number vs. total
>> migration time (benefit) and first folio migration time (downside)?
>> That can be used to determine the optimal batch number.
>>
>
> System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
> 1 NUMA node per socket, v7.1-rc1, DVFS set to Performance, PTDMA hardware.
>
> Benchmark: move_pages() syscall to move pages between two NUMA nodes.
>
> 1). Moving different sized folios such that total transfer size is constant
> (1GB), with different number of DMA channels. Throughput in GB/s.
>
> a. Baseline (vanilla kernel, single-threaded, serial folio_copy):
>
> ================================================================================
> 4K | 16K | 64K | 256K | 1M | 2M |
> ================================================================================
> 3.31±0.18 | 5.61±0.07 | 6.66±0.03 | 7.01±0.03 | 7.13±0.08 | 11.02±0.17 |
>
>
> b. DMA offload (Patched Kernel, dcbm driver, N DMA channels):
>
> ============================================================================================
> N channel| 4K | 16K | 64K | 256K | 1M | 2M |
> ============================================================================================
> 1 | 2.16±0.14 | 2.58±0.02 | 3.00±0.04 | 4.56±0.28 | 4.62±0.02 | 12.65±0.08 |
> 2 | 2.68±0.09 | 3.69±0.15 | 4.52±0.04 | 6.75±0.06 | 7.19±0.19 | 14.38±0.06 |
> 4 | 3.07±0.13 | 4.62±0.09 | 6.47±0.56 | 9.22±0.15 | 10.24±0.47 | 27.01±0.11 |
> 8 | 3.43±0.09 | 5.40±0.16 | 7.67±0.08 | 11.25±0.17 | 12.60±0.60 | 45.62±0.52 |
> 12 | 3.50±0.11 | 5.66±0.16 | 8.12±0.10 | 11.97±0.19 | 13.43±0.08 | 61.02±0.92 |
> 16 | 3.54±0.12 | 5.79±0.14 | 8.50±0.13 | 12.59±0.15 | 17.21±6.40 | 65.23±1.70 |
>
>
> 2). First-folio latency: Instrumented with custom tracepoints to measure latency per migrate_pages_batch() call.
> Result: throughput (GB/s) and first-folio latency (in microseconds), median of 10 runs.

Thanks for detailed data. Per my understanding, the run time of
migrate_pages_batch() may be not good enough for measuring first folio
latency. IIUC, the migration procedure is something like,

for each folio
unmap
flush
for each folio
copy
remap ===> first folio migrated

Some tracepoint should be better to measure it.

> A). Vanilla Kernel:
>
> Here, n = workload size passed to move_pages() in folios. Move n number of folios with move_pages().
> NR_MAX_BATCHED_MIGRATION is upstream default value 512.
>
> --- Order 0 (4K folios) ---
> n vanilla/cpu
> (folios) GB/s | first(us)
> --------------------------
> 1 0.04 | 24
> 4 0.16 | 25
> 8 0.29 | 31
> 16 0.54 | 27
> 64 1.15 | 68
> 256 1.86 | 162
> 512 2.21 | 264
> 2048 2.62 | 208
> 4096 2.74 | 182
> 16384 2.73 | 173
> 65536 3.28 | 166
> 262144 3.20 | 167
>
> --- Order 9 (2M folios) ---
> n vanilla/cpu
> (folios) GB/s | first(us)
> --------------------------
> 1 7.05 | 194
> 4 8.78 | 186
> 8 8.47 | 188
> 16 7.20 | 193
> 64 8.23 | 191
> 256 10.51 | 180
> 512 10.88 | 173
>
> Takeaway:
> In each migrate_pages_batch() call, folios are first unmapped, then try_to_unmap_flush(),
> and only then folios enter move_to_new_folio(). So first-folio latency is bounded by the
> per-batch unmap+flush cost, and then plateaus once workload is large enough.
>
>
> B). Patched kernel:
>
> Here, N = NR_MAX_BATCHED_MIGRATION (in page). Total migrated data is fixed at 1 GB.

Emm, so NR_MAX_BATCHED_MIGRATION could be very large? I think that it
needs to be bounded. If it is too large, too many pages may be in an
inaccessible state for a longer time. That will hurt the workload
performance, although it is optimal for migration performance.

> Change N with a knob to measure impact of different max batched size.
>
> --- ORDER 0 (4K folios) ---
> N offload/dma1 offload/dma4 offload/dma16
> GB/s | first(us) GB/s | first(us) GB/s | first(us)
> ------------------------------------------------------------------------
> 512 2.13 | 639 3.23 | 290 3.27 | 253
> 1024 2.17 | 1261 3.44 | 582 3.58 | 536
> 2048 2.01 | 2769 3.09 | 1360 3.45 | 1083
> 4096 2.10 | 5059 3.13 | 2737 3.58 | 2115
> 8192 2.21 | 9320 3.17 | 5015 3.75 | 3617
> 16384 2.15 | 18689 3.31 | 9623 3.87 | 6937
> 32768 2.12 | 42692 3.38 | 18893 3.83 | 14255
> 65536 2.09 | 81956 3.38 | 38556 3.64 | 29003
> 131072 2.02 | 169563 3.22 | 81082 3.63 | 62236
> 262144 2.21 | 318424 3.12 | 170174 3.50 | 129413
>
> --- ORDER 9 (2M folios) ---
> N offload/dma1 offload/dma4 offload/dma16
> GB/s | first(us) GB/s | first(us) GB/s | first(us)
> -------------------------------------------------------------------------
> 512 11.66 | 160 11.68 | 160 11.65 | 160
> 1024 12.16 | 310 13.67 | 275 13.64 | 276
> 2048 12.30 | 613 25.47 | 290 25.48 | 291
> 4096 12.48 | 1215 26.19 | 566 42.59 | 335
> 8192 12.56 | 2424 26.57 | 1118 58.72 | 470 *
> 16384 12.61 | 4839 26.77 | 2218 61.94 | 896
> 32768 12.60 | 9667 26.98 | 4422 63.75 | 1748
> 65536 12.63 | 19318 26.99 | 8838 60.66 | 3543
> 131072 12.64 | 38935 27.02 | 17935 61.06 | 7178
> 262144 12.66 | 77694 26.85 | 35871 65.06 | 14129
>
> In the batch-copy offload approach, DMA copy phase is inserted between unmap/flush and move,
> So larger N increases first-folio wall clock latency. Throughput improves but with diminishing
> returns.
>
> For DCBM+PTDMA setup, the optimal batch for 2M folios sits around N=8192-16384,
> because a larger batch allows the driver to distribute more folios across available DMA channels.
> This is where we get most throughput while keeping the first folio latency in check.
>
> This optimal batch value is hardware-specific. Other engines (eg. SDXI) and memory tier (eg. CXL)
> will likely have different curves.
>
> Does this approach and experiment look good to you?

---
Best Regards,
Huang, Ying