Re: [RFC V3 0/9] Accelerate page migration with batch copying and hardware offload

From: Huang, Ying

Date: Tue Sep 23 2025 - 21:49:52 EST

Hi, Shivank,

Thanks for working on this!

Shivank Garg <shivankg@xxxxxxx> writes:

> This is the third RFC of the patchset to enhance page migration by batching
> folio-copy operations and enabling acceleration via multi-threaded CPU or
> DMA offload.
>
> Single-threaded, folio-by-folio copying bottlenecks page migration
> in modern systems with deep memory hierarchies, especially for large
> folios where copy overhead dominates, leaving significant hardware
> potential untapped.
>
> By batching the copy phase, we create an opportunity for significant
> hardware acceleration. This series builds a framework for this acceleration
> and provides two initial offload driver implementations: one using multiple
> CPU threads (mtcopy) and another leveraging the DMAEngine subsystem (dcbm).
>
> This version incorporates significant feedback to improve correctness,
> robustness, and the efficiency of the DMA offload path.
>
> Changelog since V2:
>
> 1. DMA Engine Rewrite:
> - Switched from per-folio dma_map_page() to batch dma_map_sgtable()
> - Single completion interrupt per batch (reduced overhead)
> - Order of magnitude improvement in setup time for large batches
> 2. Code cleanups and refactoring
> 3. Rebased on latest mainline (6.17-rc6+)
>
> MOTIVATION:
> -----------
>
> Current Migration Flow:
> [ move_pages(), Compaction, Tiering, etc. ]
> |
> v
> [ migrate_pages() ] // Common entry point
> |
> v
> [ migrate_pages_batch() ] // NR_MAX_BATCHED_MIGRATION (512) folios at a time
> |
> |--> [ migrate_folio_unmap() ]
> |
> |--> [ try_to_unmap_flush() ] // Perform a single, batched TLB flush
> |
> |--> [ migrate_folios_move() ] // Bottleneck: Interleaved copy
> - For each folio:
> - Metadata prep: Copy flags, mappings, etc.
> - folio_copy() <-- Single-threaded, serial data copy.
> - Update PTEs & finalize for that single folio.
>
> Understanding overheads in page migration (move_pages() syscall):
>
> Total move_pages() overheads = folio_copy() + Other overheads
> 1. folio_copy() is the core copy operation that interests us.
> 2. The remaining operations are user/kernel transitions, page table walks,
> locking, folio unmap, dst folio alloc, TLB flush, copying flags, updating
> mappings and PTEs etc. that contribute to the remaining overheads.
>
> Percentage of folio_copy() overheads in move_pages(N pages) syscall time:
> Number of pages being migrated and folio size:
> 4KB 2MB
> 1 page <1% ~66%
> 512 page ~35% ~97%
>
> Based on Amdahl's Law, optimizing folio_copy() for large pages offers a
> substantial performance opportunity.
>
> move_pages() syscall speedup = 1 / ((1 - F) + (F / S))
> Where F is the fraction of time spent in folio_copy() and S is the speedup of
> folio_copy().
>
> For 4KB folios, folio copy overheads are significantly small in single-page
> migrations to impact overall speedup, even for 512 pages, maximum theoretical
> speedup is limited to ~1.54x with infinite folio_copy() speedup.
>
> For 2MB THPs, folio copy overheads are significant even in single page
> migrations, with a theoretical speedup of ~3x with infinite folio_copy()
> speedup and up to ~33x for 512 pages.
>
> A realistic value of S (speedup of folio_copy()) is 7.5x for DMA offload
> based on my measurements for copying 512 2MB pages.
> This gives move_pages(), a practical speedup of 6.3x for 512 2MB page (also
> observed in the experiments below).
>
> DESIGN: A Pluggable Migrator Framework
> ---------------------------------------
>
> Introduce migrate_folios_batch_move():
>
> [ migrate_pages_batch() ]
> |
> |--> migrate_folio_unmap()
> |
> |--> try_to_unmap_flush()
> |
> +--> [ migrate_folios_batch_move() ] // new batched design
> |
> |--> Metadata migration
> | - Metadata prep: Copy flags, mappings, etc.
> | - Use MIGRATE_NO_COPY to skip the actual data copy.
> |
> |--> Batch copy folio data
> | - Migrator is configurable at runtime via sysfs.
> |
> | static_call(_folios_copy) // Pluggable migrators
> | / | \
> | v v v
> | [ Default ] [ MT CPU copy ] [ DMA Offload ]
> |
> +--> Update PTEs to point to dst folios and complete migration.
>

I just jump in the discussion, so this may be discussed before already.
Sorry if so. Why not

migrate_folios_unmap()
try_to_unmap_flush()
copy folios in parallel if possible
migrate_folios_move(): with MIGRATE_NO_COPY?

> User Control of Migrator:
>
> # echo 1 > /sys/kernel/dcbm/offloading
> |
> +--> Driver's sysfs handler
> |
> +--> calls start_offloading(&cpu_migrator)
> |
> +--> calls offc_update_migrator()
> |
> +--> static_call_update(_folios_copy, mig->migrate_offc)
>
> Later, During Migration ...
> migrate_folios_batch_move()
> |
> +--> static_call(_folios_copy) // Now dispatches to the selected migrator
> |
> +-> [ mtcopy | dcbm | kernel_default ]
>

[snip]

---
Best Regards,
Huang, Ying