Re: [RFC v1] io_uring/rsrc: add fast path huge page handling in buffer registration

From: Swarna Prabhu

Date: Mon Jun 08 2026 - 22:19:27 EST

On Mon, Jun 08, 2026 at 09:57:03AM -0600, Jens Axboe wrote:
> On 6/8/26 12:29 AM, sw.prabhu6@xxxxxxxxx wrote:
> > From: Swarna Prabhu <sw.prabhu6@xxxxxxxxx>
> >
> > io_uring sqe buffer registration path returns pinned user pages in 4k
> > granularity. If the first pinned page is in a hugetlb folio and
> > pages[nr_pages - 1] is also in the same folio then store a single page
> > entry and report *npages = 1 while dropping nr_pages - 1 of the pin
> > references it took earlier.
> >
> > io_uring has support to identify and coalesce multi-hugepage-backed
> > fixed buffers from the function 'io_check_coalesce_buffer()'. However
> > we need to iterate over the entire page array and this patch bypasses
> > the additional checks for this case. The fast path reduces the overall
> > sqe buffer registration time that are backed by huge pages.
> >
> > Measured with fio on bare metal backed by 1024 boot-allocated 2MB hugetlb
> > pages and setting the cpu cores to governor for max performance.
> > (hugepages=1024,hugepage_size=2M):
> > fio --ioengine=io_uring --rw=randwrite --bs=1M --size=2G --iodepth=256
> > --direct=1 --numjobs=5 --fixedbufs=1 --registerfiles=1 --iomem=mmaphuge
> > --hugepage-size=2M.
> >
> > Avg across 3 runs:
> > Metric Upstream(7.1-rc1) Patched Delta
> > Reg time(io_sqe_buffer_register): 3797ns 2970ns -21.8%
> > Total reg for workload: 14.35ms 11.34ms -21.9%
> > fio write bandwidth: 1416MiB/s 1416MiB/s No regression
>
> This looks pretty reasonable. Curious what inspired this change though?
> Workloads that register and unregister huge page backed buffers at
> a rapid pace? The registration path should obviously not be slower than
> it needs to on purpose, but it should also not be part of the application
> fast path in general. I'd expect most users to register their IO memory
> pool upfront and then never really touch it.
>
> Can you expand on the background that led to this?

We started out looking at whether io_uring could get a bandwidth
improvement from hugetlb/THP-backed fixed buffers ie having the kernel
take better advantage of huge-page backing for the registered IO memory.
This attempt was encouraged by an RFC on the VFIO side [1], which
introduces optimization while pinning pages backed by huge pages to
avoid the latencies of pinning at 4k granularity.

io_uring has already implemented the post processing of pinned pages
from the coalesce check. So bandwidth angle didn't pan out.
However we found registration-time savings from short circuiting
the page array walks in 'io_check_coalesce_buffer' when whole buffer
lives in a single hugetlb folio.

We don't have a workload that register and unregister huge page backed
buffers at a rapid pace. Hence it is a one-time registration cost saving
that seemed worth sending for feedback.

[1] https://lore.kernel.org/all/20251223230044.2617028-2-aaronlewis@xxxxxxxxxx/>

> > Signed-off-by: Swarna Prabhu <s.prabhu@xxxxxxxxxxx>
>
> This doesn't match your From: in the patch, that would need to be
> corrected.

Noted.

Thank you
Swarna