Re: [PATCH 5/7] mm/migrate: add copy offload registration infrastructure
From: Zi Yan
Date: Thu Jun 11 2026 - 14:44:45 EST
On 11 Jun 2026, at 5:55, Karim Manaouil wrote:
> On Sun, May 24, 2026 at 07:16:55PM -0700, David Rientjes wrote:
>> On Wed, 20 May 2026, Garg, Shivank wrote:
>>
>>>>> +static bool migrate_offload_do_batch(int reason)
>>>>> +{
>>>>> + if (!static_branch_unlikely(&migrate_offload_enabled))
>>>>> + return false;
>>>>> +
>>>>> + switch (reason) {
>>>>> + case MR_COMPACTION:
>>>>> + case MR_SYSCALL:
>>>>> + case MR_DEMOTION:
>>>>> + case MR_NUMA_MISPLACED:
>>>>> + return true;
>>>>> + default:
>>>>> + return false;
>>>>
>>>>
>>>> What's the exact reason we don't do this for hotunplug etc? IOW, why do we make
>>>> this depend on a reason?
>>>
>>> Reason-based filtering could be a requirement for some users who want only specific
>>> use cases to go through DMA offload.
>>>
>>
>> +1, I think this makes a lot of sense; not all DMA offloads are created
>> equally and we may prefer to unburden them from being contended by
>> migrations that they are not intending to accelerate.
>>
>>> For the RFC, I introduced a placeholder to enable further discussion on which use cases
>>> should allow migration offload and whether offload users actually need this control?
>>>
>>> Your other point also makes sense: "If someone migrates a handful of folios, latency is
>>> likely more important (and batching less beneficial)."
>>> Based on this, we could either fully rely on batch size. I'll think more about this.
>>>
>>
>> There are, or will be, some offloads that must be used for certain types
>> of page migrations, like Confidential Computing. That's for functional
>> reasons, not a heuristic.
>>
>> We want to use certain hardware assists solely for promotion and demotion
>> of memory for tiering. We certainly wouldn't want those hardware assists
>> to be inundated by users doing tons of move_pages(2) on their own or by
>> best effort memory compaction in the kernel.
>
> One of the biggest arguments against using memory-offload accelerators for
> generic page migration is that user-mapped memory migration is dominated
> by rmap (unmapping before migration) and remove_migration_ptes() after
> memory copy. In my experiments with Intel DSA on Sapphire Rapid,
> assuming you're copying 4KiB pages at a deep queue depth (64/128 pages).
>
> 1. A single instance can achieve up to 30GB/s when counting
> only descriptor submission and polling for completion.
>
> 2. With DMA mapping overhead, which you usually have to do when you
> are behind an IOMMU, it's about 23GB/s in IOMMU passthrough mode,
> and 17GB/s in translated mode. That's still more than double the
> throughput achieved by a single CPU core at around 9GB/s.
>
> 3. The throughput drops significantly when measuring the
> completetion of the entire migrate_pages() at barely 1GB/s.
>
> Unmapping is about 50% of the total time and remove_migration_ptes() is
> about 40/45% in my experiments, which leaves only 5/10% on the table for
Do you have a breakdown of unmapping and remove_migration_ptes()?
Are these 4KB pages shared or private? I can see that a shared page can
cost more during rmap. BTW, I assume your measurements are done
with migrate_pages_batch(), which performs TLB shootdown once for all
these pages, otherwise TLB shootdown cost would dominate.
> the actual page data copy. You're not bottlenecked by memory bandwidth
> to justify using a memory-offload accelerators and the CPU cycles you
> save from offloading is too small to matter.
>
> So any discussion on accelerating promotion/demotion using these
> accelerators is useless to me, unless I am missing something.
>
> Like what kind of use cases do you have in mind? What kind of workloads
> and pages you're thinking? Especially, via migrate_pages() which is
> largely involved in scenarios involving user-mapped memory (anonymous and
> file pages) and thus heavy rmap involvment.
A paper[1] shows that using DSA can improve memory compaction speed and
memory compaction should migrate <2MB folios.
[1] https://ieeexplore.ieee.org/document/10841986
>
>> I think the use cases should be configurable by the user if at all
>> possible so we can control what has access to being offloaded. These are
>> often shared system resources and can be contended like any other
>> resource, so configuring which migrations can use them vs not use them
>> seems important.
>
> --
> ~karim
Best Regards,
Yan, Zi