Re: [RFC PATCH 0/5] vfio/pci: Support ZONE_DEVICE-backed P2P Registration

From: Pranjal Shrivastava

Date: Fri Jun 12 2026 - 10:50:42 EST


On Thu, Jun 11, 2026 at 07:14:47PM -0300, Jason Gunthorpe wrote:
> On Thu, Jun 11, 2026 at 02:40:17PM +0000, Pranjal Shrivastava wrote:
> > On Wed, Jun 10, 2026 at 01:28:48PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Jun 10, 2026 at 03:18:48PM +0000, Pranjal Shrivastava wrote:
> > >
> > > > Users utilize the standard sysfs p2pmem/allocate interface for managing
> > > > memory slices once a BAR is registered.
> > >
> > > I'm shocked someone wants to use API, what are you expecting to do
> > > with it??
> >
> > Our primary use-case is PCIe BAR (DDR / HBM) -> NFS via P2PDMA while the
> > PCIe device is managed by a user-space driver based on vfio-pci. While
> > kernel drivers (e.g.drm) can register BARs with ZONE_DEVICE natively to
> > enable this, VFIO currently lacks an equivalent mechanism.
>
> I mean the weird sysfs mmap API. It is only useful if the device is
> basically pure memory with no functionality. You can't even learn what
> MMIO offset the returned allocation gives so it is almost completely
> useless.
>
> nvme could use it because CMB is pure memory and you reference it by
> its MMIO address, but that doesn't apply to VFIO..
>

Ack, I agree, sysfs allocation doesn't provide the offset-level control.
I'll pivot entirely to the DMABUF approach.

> > > > An alternative implementation has been explored which integrates with the
> > > > ongoing VFIO DMABUF-mmap refactor [1]. In that approach, rather than
> > > > registering a BAR as a system-wide P2P provider, VFIO optionally
> > > > allocates ZONE_DEVICE pages only for specifically exported DMABUFs via a
> > > > new VFIO_DMA_BUF_FLAG_ALLOC_STRUCT_PAGES flag.
> > >
> > > That's probably more sensible but you can't have a DMABUF mmap
> > > actually install non-special memory. The native vfio mmap still can,
> > > but not mmap on the dmabuf fd. That's still workable, just keep in
> > > mind.
> >
> > Ack. I guess, we could have a separate mmap path in case of BARs that are
> > struct page backed which doesn't go through the dmabuf exporter.
>
> The dmabuf export is perfectly fine, you just have to think very
> carefully about the mmap path.
>
> I suppose if you build the proper revocation fence for zone device
> pages as part of the vfio implementation it would be OK for dmabuf
> mmap to expose them as well since it would have the right lifecycle
> model.
>

Ack, I'll move forward with adding a flag to request a ZONE_DEVICE-backed
DMABUF export (the 'Alternative Approach' mentioned in the cover letter).

And yes, I agree we need to ensure the mmap path is handled carefully
with the correct lifecycle in mind.

> That's the tricky thing with zone_device, you have to be careful to
> wait for all the page references to be put back at all the right
> times.

Yea, that's going to be tricky.. I'm thinking if we can have a zap model
there somehow? If the device is gone / going through a reset, we can
handle the refcounts accordingly?

>
> Come to think of it, since the sysfs API cannot do that in the way
> VFIO wants I actually think you can't use it..

Ack. Baking this into the VFIO DMABUF allows us to enforce the right
lifecycle.

My plan for RFC v2 is to add a flag like VFIO_DMA_BUF_FLAG_ZONE_DEVICE
to struct vfio_device_feature_dma_buf which allows the caller to opt-in
to ZONE_DEVICE backing specifically for that export.

Does this opt-in flag sound like a reasonable uAPI or do you see any
concerns with this direction?

Otherwise, as you noted, the lifecycle and the mmap path remain the main
problems to solve.

Thanks,
Praan