Re: [PATCH 0/7] Accelerate page migration with batch copying and hardware offload

From: Garg, Shivank

Date: Wed May 20 2026 - 11:41:55 EST

On 5/11/2026 9:23 PM, David Hildenbrand (Arm) wrote:
> On 4/28/26 17:50, Shivank Garg wrote:
>> This is the fifth RFC of the patchset to enhance page migration by
>
> Ah, this is an RFC ...
>
> ... I suggest b4 for patch series management :P
>
> That also explains why patch #7 is still in there.
>

yes, started using it :)

Patch 7 is for testing only but I need to think on optimum batch-size for
offload which depends on HW, or have a callback as per Huang Ying's suggestion.

>> batching folio-copy operations and enabling acceleration via DMA offload.
>>
>> Single-threaded, folio-by-folio copying bottlenecks page migration in
>> modern systems with deep memory hierarchies, especially for large folios
>> where copy overhead dominates, leaving significant hardware potential
>> untapped.
>>
>> By batching the copy phase, we create an opportunity for hardware
>> acceleration. This series builds the framework and provides a DMA
>> offload driver (dcbm) as a reference implementation, targeting bulk
>> migration workloads where offloading the copy improves throughput
>> and latency while freeing the CPU cycles.
>>
>> See the RFC V3 cover letter [2] for motivation.
>>
>> Changelog since V4:
>> -------------------
>>
>> 1. Renamed PAGE_* migration state flags to FOLIO_*. (David)
>> 2. Use the new folio->migrate_info field instead of folio->private
>> for migration state. (David)
>> 3. Fold folios_mc_copy patch in batch-copy implementation patch. (David)
>> 3. Renamed migrate_offload_start()/stop() to register()/unregister().
>> (Huang, Ying)
>> 4. Dropped should_batch() callback from struct migrator. Reason-based
>> policy now lives in migrate_pages_batch(). Migrators can still skip
>> a batch they don't want (size based policy). (Huang, Ying)
>> 5. CONFIG_MIGRATION_COPY_OFFLOAD is now hidden and selected by the
>> migrator driver. CONFIG_DCBM_DMA is tristate. (Huang Ying, Gregory Price).
>> 6. Wrapped the SRCU + static_call dispatch in a small helper. (Huang, Ying)
>> 7. Requir m->owner in migrate_offload_register(), SRCU sync at
>> unregister relies on it. Counters are atomic_long_t to avoid lock-order
>> issue.
>> 9. Moved DCBM sysfs from /sys/kernel/dcbm to /sys/module/dcbm (Huang, Ying)
>> 10. Rebased on v7.1-rc1.
>>
>
> [...]
>
>>
>> OPEN QUESTIONS:
>> ---------------
>>
>> 1. Should the batch path run without a registered migrator? Patches 1-4
>> are self-contained and use folios_mc_copy() (CPU). I have several
>> options like making batch path always-on for eligible folios, or
>> giving admin an option to flip the static branch, or keep the gate.
>> I'm leaning toward always-on.
>
> Hiding that detail from migrate.c sounds interesting.
>

Yes, will do that.

>> 2. Carrying already_copied via folio->migrate_info vs changing the
>> migrate_folio() callback signature (Huang, Ying). I went with the
>> field for now to avoid touching every fs callback before the design
>> settles. Happy to revisit.
>>
>> 3. Per-caller offload selection: Today eligibility is by migrate_reason
>> only. Some are latency-tolerant, others may be not. Is reason the
>> right granularity, or do we want a per-caller hint?
>
> Isn't it sufficient to just do it based on the #folios or sth like that?
>
> If someone migrates a handful of folios, latency is likely more important (and
> batching less beneficial).
>
> I'd assume when migrating many folios, batching could just always be done. Or
> what's the concern?
>

It could be a requirement for some users who want only specific use cases to go
through DMA offload.

I agree with your point, and will discuss more on it.

>>
>> 4. Cgroup integration: How should per-cgroup be accounted for different
>> migrators (e.g.: any accounting for DMA-busy time)?
>
> Oh. Do we even have to mess with that?

Probably not for the intial series.
Will drop this question.

>>
>> 5. Tuning migrate_pages callers for offloading. For instance, in
>> compaction COMPACT_CLUSTER_MAX = 32 caps DMA's payoff for compaction
>> (V4 experiment).
>
> Is that HW dependent?
>
>>
>> 6. Where do batch-size thresholds live, and how are they tuned? Per
>> Huang Ying's split, that policy lives in the migrator. DCBM has no
>> threshold today. Open whether it should later be a per-migrator
>> sysfs knob or hard-coded; probably clearer once a second migrator
>> (SDXI, mtcopy) shows the trade-off.
>
> Again, sounds like being HW dependent, no?

Yes, both are HW dependent.
Batch-size gating fits naturally in the migrator.
For something like COMPACT_CLUSTER_MAX, would a callback from compaction
to registered migrator is right thought? or do you have something else in mind?
For initial series, I think I need not mess with it.

Thanks,
Shivank