Re: [LSF/MM/BPF TOPIC] Enhancements to Page Migration with Multi-threading and Batch Offloading to DMA

From: Shivank Garg
Date: Tue Mar 25 2025 - 01:21:02 EST

Next message: Yuvaraj Ranganathan: "Re: [PATCH 1/2] arm64: dts: qcom: sa8775p: add QCrypto node"
Previous message: Sumit Garg: "Re: [PATCH 2/2] tpm/tpm_ftpm_tee: use send_recv() op"
In reply to: Shivank Garg: "Re: [LSF/MM/BPF TOPIC] Enhancements to Page Migration with Multi-threading and Batch Offloading to DMA"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 3/24/2025 11:31 AM, Shivank Garg wrote:
>
>
> On 1/23/2025 11:25 AM, Shivank Garg wrote:
>> Hi all,
>>
>> Zi Yan and I would like to propose the topic: Enhancements to Page
>> Migration with Multi-threading and Batch Offloading to DMA.
>>
>> Page migration is a critical operation in NUMA systems that can incur
>> significant overheads, affecting memory management performance across
>> various workloads. For example, copying folios between DRAM NUMA nodes
>> can take ~25% of the total migration cost for migrating 256MB of data.
>>
>> Modern systems are equipped with powerful DMA engines for bulk data
>> copying, GPUs, and high CPU core counts. Leveraging these hardware
>> capabilities becomes essential for systems where frequent page promotion
>> and demotion occur - from large-scale tiered-memory systems with CXL nodes
>> to CPU-GPU coherent system with GPU memory exposed as NUMA nodes.
>>
>> Existing page migration performs sequential page copying, underutilizing
>> modern CPU architectures and high-bandwidth memory subsystems.
>>
>> We have proposed and posted RFCs to enhance page migration through three
>> key techniques:
>> 1. Batching migration operations for bulk copying data [1]
>> 2. Multi-threaded folio copying [2]
>> 3. DMA offloading to hardware accelerators [1]
>>
>> By employing batching and multi-threaded folio copying, we are able to
>> achieve significant improvements in page migration throughput for large
>> pages.
>>
>> Discussion points:
>> 1. Performance:
>> a. Policy decision for DMA and CPU selection
>> b. Platform-specific scheduling of folio-copy worker threads for better
>> bandwidth utilization
>> c. Using Non-temporal instructions for CPU-based memcpy
>> d. Upscaling/downscaling worker threads based on migration size, CPU
>> availability (system load), bandwidth saturation, etc.
>> 2. Interface requirements with DMA hardware:
>> a. Standardizing APIs for DMA drivers and support for different DMA
>> drivers
>> b. Enhancing DMA drivers for bulk copying (e.g., SDXi Engine)
>> 3. Resources Accounting:
>> a. CPU cgroups accounting and fairness [3]
>> b. Who bears migration cost? - (Migration cost attribution)
>>
>
> Hi all,
>
> For reference, here is the link to the latest RFC v2:
>
> https://lore.kernel.org/linux-mm/20250319192211.10092-1-shivankg@xxxxxxx
>
> This version combines the ideas discussed in [1] and [2] and includes details
> on performance improvements and experimental findings to provide more context
> for discussion.

Sharing the slides from today’s presentation:

Main Slide Deck: https://docs.google.com/presentation/d/1mjl5-jiz-TMVRK9bQcQ_IsSXrIP82CqWS8Q6em3mJi0/edit?usp=sharing
Multi-threading Slide Deck: https://docs.google.com/presentation/d/10czypcUbRMOUn6knp340Cwv4bf83Ha2gUX8TwNXUwCs/edit#slide=id.p6

Thanks,
Shivank

>
>> References:
>> [1] https://lore.kernel.org/all/20240614221525.19170-1-shivankg@xxxxxxx
>> [2] https://lore.kernel.org/all/20250103172419.4148674-1-ziy@xxxxxxxxxx
>> [3] https://lore.kernel.org/all/CAHbLzkpoKP0fVZP5b10wdzAMDLWysDy7oH0qaUssiUXj80R6bw@xxxxxxxxxxxxxx
>
> Looking forward to your feedback!
>
> Thanks,
> Shivank
>

Next message: Yuvaraj Ranganathan: "Re: [PATCH 1/2] arm64: dts: qcom: sa8775p: add QCrypto node"
Previous message: Sumit Garg: "Re: [PATCH 2/2] tpm/tpm_ftpm_tee: use send_recv() op"
In reply to: Shivank Garg: "Re: [LSF/MM/BPF TOPIC] Enhancements to Page Migration with Multi-threading and Batch Offloading to DMA"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]