Re: [PATCH v2] dma-buf: Split sgl into page-aligned 2G chunks
From: Pranjal Shrivastava
Date: Tue Jun 23 2026 - 16:57:44 EST
On Tue, Jun 23, 2026 at 09:44:46AM +0100, David Laight wrote:
Hi David,
> On Tue, 23 Jun 2026 01:54:59 +0000
> David Hu <xuehaohu@xxxxxxxxxx> wrote:
>
> > Currently, `fill_sg_entry()` splits the scatterlist using `UINT_MAX`.
> > This creates a non-page-aligned DMA length (`0xFFFFFFFF`) for the
> > first entry, resulting in non-page-aligned DMA addresses for all
> > subsequent entries.
>
> There is a separate issue of whether this code is even needed at all.
> Where can transfers over 2G (never mind 4G) actually come from.
>
> The read, write and similar system calls limit transfers to INT_MAX
> (even on 64bit) and a lot of driver code will need fixing it longer
> lengths are allowed though.
> io_uring better enforce the same limits.
> So the transfers can come directly from userspace.
>
> Not only that but you also need a single physically contiguous buffer.
> Good luck allocating that!
>
> Now maybe there are some peer-to-peer places where the large buffer
> is device memory, but they will be unusual and probably need
> special treatment anyway.
>
I agree that traditional VFS read/write face the MAX_RW_COUNT limit
(~2GB), and io_uring has its limits, but I'm a little confused by the
push to enforce these limits here in the SGL code?
File I/O seems to be only one side of the picture. In my view, this fix
is necessary and certainly has a use-case:
For example, the RDMA subsystem has the capability to import dmabufs [1],
which gives rise to use cases for dmabuf beyond standard file ops
(via VFS/io_uring).
In these scenarios, GPU HBM can be exported as dmabufs. With recent GPUs,
HBM capacity can be in the order of hundreds of GBs [2]. RDMA can employ
infrastructure like the vfio-dmabuf-exporter [3] or similar dmabuf
exporters to frequently move huge blocks of data via P2PDMA.
If we restrict incoming dmabuf transfers to fit within VFS-centric
limits (2GB), we impose unnecessary overhead on the RDMA stack, forcing
it to manage a significantly higher number of memory registrations. By
cleanly splitting these massive contiguous device buffers into
page-aligned SGL entries, we directly improve the efficiency of P2P
transfers and memory registration.
Since this change doesn't seem to have a negative impact on standard file
I/O or break existing VFS constraints, I'm curious why we shouldn't
support splitting these >4GB P2P transfers? Am I missing something?
Thanks,
Praan
[1] https://elixir.bootlin.com/linux/v7.1.1/source/drivers/infiniband/core/umem_dmabuf.c#L174
[2] https://nvdam.widen.net/s/fdvdqvfvj2/hopper-h200-nvl-product-brief (Table 2-2)
[3] https://elixir.bootlin.com/linux/v7.1.1/source/drivers/vfio/pci/vfio_pci_dmabuf.c#L297