Re: [PATCH 5/7] mm/migrate: add copy offload registration infrastructure

From: Karim Manaouil

Date: Thu Jun 11 2026 - 06:00:10 EST

On Sun, May 24, 2026 at 07:16:55PM -0700, David Rientjes wrote:
> On Wed, 20 May 2026, Garg, Shivank wrote:
>
> > >> +static bool migrate_offload_do_batch(int reason)
> > >> +{
> > >> + if (!static_branch_unlikely(&migrate_offload_enabled))
> > >> + return false;
> > >> +
> > >> + switch (reason) {
> > >> + case MR_COMPACTION:
> > >> + case MR_SYSCALL:
> > >> + case MR_DEMOTION:
> > >> + case MR_NUMA_MISPLACED:
> > >> + return true;
> > >> + default:
> > >> + return false;
> > >
> > >
> > > What's the exact reason we don't do this for hotunplug etc? IOW, why do we make
> > > this depend on a reason?
> >
> > Reason-based filtering could be a requirement for some users who want only specific
> > use cases to go through DMA offload.
> >
>
> +1, I think this makes a lot of sense; not all DMA offloads are created
> equally and we may prefer to unburden them from being contended by
> migrations that they are not intending to accelerate.
>
> > For the RFC, I introduced a placeholder to enable further discussion on which use cases
> > should allow migration offload and whether offload users actually need this control?
> >
> > Your other point also makes sense: "If someone migrates a handful of folios, latency is
> > likely more important (and batching less beneficial)."
> > Based on this, we could either fully rely on batch size. I'll think more about this.
> >
>
> There are, or will be, some offloads that must be used for certain types
> of page migrations, like Confidential Computing. That's for functional
> reasons, not a heuristic.
>
> We want to use certain hardware assists solely for promotion and demotion
> of memory for tiering. We certainly wouldn't want those hardware assists
> to be inundated by users doing tons of move_pages(2) on their own or by
> best effort memory compaction in the kernel.

One of the biggest arguments against using memory-offload accelerators for
generic page migration is that user-mapped memory migration is dominated
by rmap (unmapping before migration) and remove_migration_ptes() after
memory copy. In my experiments with Intel DSA on Sapphire Rapid,
assuming you're copying 4KiB pages at a deep queue depth (64/128 pages).

1. A single instance can achieve up to 30GB/s when counting
only descriptor submission and polling for completion.

2. With DMA mapping overhead, which you usually have to do when you
are behind an IOMMU, it's about 23GB/s in IOMMU passthrough mode,
and 17GB/s in translated mode. That's still more than double the
throughput achieved by a single CPU core at around 9GB/s.

3. The throughput drops significantly when measuring the
completetion of the entire migrate_pages() at barely 1GB/s.

Unmapping is about 50% of the total time and remove_migration_ptes() is
about 40/45% in my experiments, which leaves only 5/10% on the table for
the actual page data copy. You're not bottlenecked by memory bandwidth
to justify using a memory-offload accelerators and the CPU cycles you
save from offloading is too small to matter.

So any discussion on accelerating promotion/demotion using these
accelerators is useless to me, unless I am missing something.

Like what kind of use cases do you have in mind? What kind of workloads
and pages you're thinking? Especially, via migrate_pages() which is
largely involved in scenarios involving user-mapped memory (anonymous and
file pages) and thus heavy rmap involvment.

> I think the use cases should be configurable by the user if at all
> possible so we can control what has access to being offloaded. These are
> often shared system resources and can be contended like any other
> resource, so configuring which migrations can use them vs not use them
> seems important.

--
~karim