Re: [PATCH] dma-buf: Split sgl by largest page-aligned chunk

From: David Hu

Date: Mon Jun 22 2026 - 17:27:18 EST

On Mon, Jun 22, 2026 at 4:13 AM David Laight
<david.laight.linux@xxxxxxxxx> wrote:
>

Hi David,

Thank you for your review. You raised many good points regarding
optimizations here. I'll switch to using 2G as the max entry size
(`SZ_2G` from `linux/sizes.h`), and remove divisions and
multiplications. I'll also replace the `for()` loop with `while
(length)`, and drop `min_t()` in favor of `min()` by casting `SZ_2G`
to `size_t`. I'll send out a v2 with these changes shortly.

Thanks,
David

> > Currently, `fill_sg_entry()` splits the scatterlist using `UINT_MAX`.
> > This creates a non-page-aligned DMA length (`0xFFFFFFFF`) for the
> > first entry, resulting in non-page-aligned DMA addresses for all
> > subsequent entries.
>
> How did you find this?
> It requires a single buffer over 4GB - seems highly unlikely.

It was observed during experiments with buffers over 8GB on an accelerator.

> >
> > While the underlying IOMMU mapping may be contiguous, hardware
> > DMA engines often require explicit address alignment (e.g., page,
> > cacheline, or storage sector boundaries). Passing unaligned
> > addresses and lengths can cause explicit failures in DMA descriptor
> > creation or silent data corruption if lower unaligned bits are
> > truncated.
> >
> > Fix this by splitting the scatterlist by the largest possible page
> > aligned chunk within `UINT_MAX` (`ALIGN_DOWN(UINT_MAX, PAGE_SIZE)`).
> > This ensures all scatterlist DMA addresses and lengths remain page
> > aligned and satisfy hardware constraints.
>
> It would almost certainly better to spilt into 2G chunks.
> That removes any need for any divisions.

I agree. 2G naturally aligns with most hardware boundaries, while also
allowing compiler optimizations with simple bit shifts.

>
> > Page-aligned entries allow the system to cleanly chunk payloads into
> > PCIe MaxPayloadSize (MPS) (e.g., 128 bytes, 256 bytes, 512 bytes).
> > As a result, this may help reduce TLP fragmentation in P2P transfers
> > and alleviate potential congestion within a logical PCIe switch
> > partition, especially when Relaxed Ordering is not possible due to
> > hardware constraints.
> >
> > Reported-by: sashiko-bot <sashiko-bot@xxxxxxxxxx>
> > Closes: https://lore.kernel.org/all/20260609165431.778061F00893@xxxxxxxxxxxxxxx/
> > Fixes: 3aa31a8bb11e ("dma-buf: provide phys_vec to scatter-gather mapping routine")
> > Cc: stable@xxxxxxxxxxxxxxx
> > Signed-off-by: David Hu <xuehaohu@xxxxxxxxxx>
> > ---
> > drivers/dma-buf/dma-buf-mapping.c | 13 ++++++++-----
> > 1 file changed, 8 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/dma-buf/dma-buf-mapping.c b/drivers/dma-buf/dma-buf-mapping.c
> > index 794acff2546a..f2bde38fdb1f 100644
> > --- a/drivers/dma-buf/dma-buf-mapping.c
> > +++ b/drivers/dma-buf/dma-buf-mapping.c
> > @@ -5,6 +5,9 @@
> > */
> > #include <linux/dma-buf-mapping.h>
> > #include <linux/dma-resv.h>
> > +#include <linux/align.h>
> > +
> > +#define MAX_ENT_SZ ALIGN_DOWN(UINT_MAX, PAGE_SIZE)
>
> >
> > static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
> > dma_addr_t addr)
> > @@ -12,9 +15,9 @@ static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
> > unsigned int len, nents;
> > int i;
> >
> > - nents = DIV_ROUND_UP(length, UINT_MAX);
> > + nents = DIV_ROUND_UP(length, MAX_ENT_SZ);
> > for (i = 0; i < nents; i++) {
>
> Why not change that to 'while (length) {' to avoid the division above.

Sounds good, will do.

>
> > - len = min_t(size_t, length, UINT_MAX);
> > + len = min_t(size_t, length, MAX_ENT_SZ);
>
> I bet that doesn't need to be min_t()

Agreed.

>
> > length -= len;
> > /*
> > * DMABUF abuses scatterlist to create a scatterlist
> > @@ -24,7 +27,7 @@ static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
> > * does not require the CPU list for mapping or unmapping.
> > */
> > sg_set_page(sgl, NULL, 0, 0);
> > - sg_dma_address(sgl) = addr + (dma_addr_t)i * UINT_MAX;
> > + sg_dma_address(sgl) = addr + (dma_addr_t)i * MAX_ENT_SZ;
> > sg_dma_len(sgl) = len;
>
> Replace the multiply with 'addr += len'.

Will update this as well.

>
> -- David
>
> > sgl = sg_next(sgl);
> > }
> > @@ -41,14 +44,14 @@ static unsigned int calc_sg_nents(struct dma_iova_state *state,
> >
> > if (!state || !dma_use_iova(state)) {
> > for (i = 0; i < nr_ranges; i++)
> > - nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
> > + nents += DIV_ROUND_UP(phys_vec[i].len, MAX_ENT_SZ);
> > } else {
> > /*
> > * In IOVA case, there is only one SG entry which spans
> > * for whole IOVA address space, but we need to make sure
> > * that it fits sg->length, maybe we need more.
> > */
> > - nents = DIV_ROUND_UP(size, UINT_MAX);
> > + nents = DIV_ROUND_UP(size, MAX_ENT_SZ);
> > }
> >
> > return nents;
>