Re: [PATCH 6/7] drivers/migrate_offload: add DMA batch copy driver (dcbm)

From: Garg, Shivank

Date: Mon Jun 22 2026 - 06:06:30 EST

On 6/19/2026 9:37 PM, Karim Manaouil wrote:
> [You don't often get email from kmanaouil.dev@xxxxxxxxx. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> Hi again Shivank,
>
> I just got some time to resume testing this on Intel Sapphire Rapids and
> something caught my attention, below
>
> On Tue, Apr 28, 2026 at 03:50:49PM +0000, Shivank Garg wrote:
>> +static int submit_dma_transfers(struct dma_work *work)
>> +{
>> + struct scatterlist *sg_src, *sg_dst;
>> + struct dma_async_tx_descriptor *tx;
>> + unsigned long flags = DMA_CTRL_ACK;
>> + dma_cookie_t cookie;
>> + int i;
>> +
>> + atomic_set(&work->pending, 1);
>> +
>> + sg_src = work->src_sgt->sgl;
>> + sg_dst = work->dst_sgt->sgl;
>> + for_each_sgtable_dma_sg(work->src_sgt, sg_src, i) {
>> + if (i == work->src_sgt->nents - 1)
>> + flags |= DMA_PREP_INTERRUPT;
>> +
>> + tx = dmaengine_prep_dma_memcpy(work->chan,
>> + sg_dma_address(sg_dst),
>> + sg_dma_address(sg_src),
>> + sg_dma_len(sg_src), flags);
>> + if (!tx) {
>> + atomic_set(&work->pending, 0);
>> + return -EIO;
>> + }
>> +
>> + if (i == work->src_sgt->nents - 1) {
>> + tx->callback = dma_completion_callback;
>> + tx->callback_param = work;
>> + }
>> +
>
> Here, you are submitting the descriptors one after the other and only
> the last descriptor has a callback, which in theory sounds correct as
> you expect the DMA engine to complete the descriptors in the same order
> they were submitted. However, in reality that's not really gauranteed.
>
> Intel DSA in particular can complete descriptors out of order. That
> means, the last descriptor submitted may not necessarily be the last
> descriptors that completes. In that case, you will return in
> folios_copy_dma() before the copy truly completes for all the folios.
>
> For correctness, we have to add a callback to every descriptor and
> initialize work->pending to the number of descriptors submitted then
> every time a descriptor completes, you call atomic_dec(&work->pending)
> and only complete the completion the moment it reaches zero.
>
> Btw, waiting for an interrupt adds massive scheduling overhead. If we
> also add the logic above, it'll get even worse. In my measurements, this
> can easily add up to 6ms, by which CPU page copy have easily completed
> the entire copy, which again adds to the list of latency concerns I
> raised in my other reply.

Thanks Karim for catching this.
I was not aware that descriptor chaining was not applicable for DSA.

Going forward, implementing the device_prep_dma_memcpy_sg() fixes this
broken assumption. So client will issue single transaction and see single
completion for whole batch. The ordering/correctness will become provider's
responsibility.

Thanks,
Shivank