[no subject]
From: Shivank Garg
Date: Wed Jun 10 2026 - 08:34:50 EST
...
> I'm still testing, but the initial implementation I wrote with
> DMAEngine had too much overhead because of the sgtable allocations
> and the conversion between kernel scatterlists to device descriptors.
> So I entirely bypassed the DMAEngine API by directly passing the folios
> lists to the driver.
>
> I know it depends on the use case. If you just want to offload with no
> latency requirements, then DMAEngine is fine, but if the goal is to
> achieve high bandwidth with minimal latency, then it's a problem.
>
> Another example, if you have to do several independent copies of 256 or
> 512 4KiB pages in a short period of time, there will to much stress on
> sgtable allocations.
>
> Another problem for low latency is DMA mapping.
>
> Anyway, I need to collect more numbers. I will try to share my insights
> with idxd asap.
Thanks, looking forward to those insights and numbers.
An IDXD specific implementation is good for experimentation, but
for upstream path, I think this would be hard to maintain and add duplicate
logic. The cleanest approch is the DMA_MEMCPY_SG API. So, a single offload
driver can drive any engine that implements it. dmaengine_prep_dma_memcpy_sg()
submits a whole src/dst scatterlist as one transaction, which cuts the
per-descriptor setup overhead that dominates for 4KB pages.
I've added a patch for dmaengine_prep_dma_memcpy_sg(), Could you look into
wiring up device_prep_dma_memcpy_sg hook in the IDXD?
This will keep it generic and address the bandwidth/latency problem for
small transfers.
Best Regards,
Shivank
---