Re: [PATCH] vfio: Request THP-aligned mmap for device fds
From: Jason Gunthorpe
Date: Tue Jun 30 2026 - 08:56:52 EST
On Mon, Jun 22, 2026 at 04:42:13PM +0100, Lorenzo Stoakes wrote:
> On Fri, Jun 19, 2026 at 02:07:05PM -0300, Jason Gunthorpe wrote:
> > On Fri, Jun 19, 2026 at 05:11:50PM +0100, Matthew Wilcox wrote:
> > > On Thu, Jun 18, 2026 at 12:28:05PM -0300, Jason Gunthorpe wrote:
> > > > On Thu, Jun 18, 2026 at 03:55:58PM +0100, Lorenzo Stoakes wrote:
> > > > > Can't we figure this out from what the driver tells us when it invokes an
> > > > > mmap_prepare action?
> > > >
> > > > VFIO installs the pages via fault handler so there is not a naturally
> > > > existing way to pass in the pfn?
> > >
> > > Is there an advantage to doing it this way? I understand why we (eg)
> > > demand-page pagecache, that's obvious. But I've never really understood
> > > the advantage to taking page faults for PFNMAP areas where we don't
> > > really do anything, just figure out which PFN needs to be installed.
> > > It defers page table allocation, I suppose.
> >
> > VFIO has a model where the mapping can come and go, so it makes the
> > entire VMA SIGBUS from time to time. The only way to do this currently
> > is with faulting.
> >
> > The mm also had races around populating the mmap in the mmap callback
> > and using zap on the inode, faulting avoids those too. Lorenzo may
> > have fixed that with the new interface though
>
> Well, you can't populate the mmap in .mmap_prepare, we do it for you.
>
> I guess the issue there is an race with an rmap walker? I did add a (slightly
> hideous) hack^Woption that keeps things rmap-locked until after the 'mmap
> action' is complete (action->hold_rmap_lock).
Yeah, I think that is partially right, if something wants to use zap
then there must be some kind of locking that guarentees after zap
there are never any stray PTEs. So if you race zap with mmap() the
mmap must complete and the PTEs must always be non-present.
Certainly rmap locking is a part of this, but you also need locking to
not populate the VMA in the first place.
Driver CPU 0 Zap CPU 1
============ ========
mmap()
driver lock
if (zapping)
do nothing
else
remap pfn
driver unlock
driver lock
zapping = true
unmap_mapping_range(inode)
driver unlock
<no present PTE may exist in inode>
IIRC the current race is the mm calls the above pattern's mmap() prior to
setting up the rmap, so the remap_pfn succeeds, the
unmapping_mapping_range is a NOP, and we leak mapped VMAs.
But the ideal thing would be to allow the driver to populate such that
it is after the rmap is setup, under the driver lock, and optional so
an in-progress zap can be handled.
Jason