Re: [PATCH] vfio/pci: take mmap write lock for io_remap_pfn_range
From: Alex Williamson
Date: Wed May 22 2024 - 15:43:18 EST
On Wed, 22 May 2024 15:30:46 -0300
Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
> On Wed, May 22, 2024 at 11:50:06AM -0600, Alex Williamson wrote:
> > I'm not sure if there are any outstanding blockers on Peter's side, but
> > this seems like a good route from the vfio side. If we're seeing this
> > now without lockdep, we might need to bite the bullet and take the hit
> > with vmf_insert_pfn() while the pmd/pud path learn about pfnmaps.
>
> There is another alternative...
>
> Ideally we wouldn't use the fault handler.
>
> Instead when the MMIO becomes available again we'd iterate over all
> the VMA's and do remap_pfn_range(). When the MMIO becomes unavailable
> we do unmap_mapping_range() and remove it. This whole thing is
> synchronous and the fault handler should simply trigger SIGBUS if
> userspace races things.
>
> unmap_mapping_range() is easy, but the remap side doesn't have a
> helper today..
>
> Something grotesque like this perhaps?
>
> while (1) {
> struct mm_struct *cur_mm = NULL;
>
> i_mmap_lock_read(mapping);
> vma_interval_tree_foreach(vma, mapping->i_mmap, 0, ULONG_MAX) {
> if (vma_populated(vma))
> continue;
>
> cur_mm = vm->mm_struct;
> mmgrab(cur_mm);
> }
> i_mmap_unlock_read(mapping);
>
> if (!cur_mm)
> return;
>
> mmap_write_lock(cur_mm);
> i_mmap_lock_read(mapping);
> vma_interval_tree_foreach(vma, mapping->i_mmap, 0, ULONG_MAX) {
> if (vma->mm_struct == mm)
> vfio_remap_vma(vma);
> }
> i_mmap_unlock_read(mapping);
> mmap_write_unlock(cur_mm);
> mmdrop(cur_mm);
> }
Yes, I've played with similar it's a pretty ugly and painful path
versus lazily faulting.
> I'm pretty sure we need to hold the mmap_write lock to do the
> remap_pfn..
>
> vma_populated() would have to do a RCU read of the page table to check
> if the page 0 is present.
>
> Also there is a race in mmap if you call remap_pfn_range() from the
> mmap fops and also use unmap_mapping_range(). mmap_region() does
> call_mmap() before it does vma_link_file() which gives a window where
> the VMA is populated but invisible to unmap_mapping_range(). We'd need
> to add another fop to call after vma_link_file() to populate the mmap
> or rely exclusively on the fault handler to populate.
Thanks, the branch I linked added back remap_pfn_range() on mmap,
conditional on locking/enabled, but I've predominantly been testing
with that disabled to exercise the pmd/pud faults. I would have missed
the gap vs unmap_mapping_range().
It's hard to beat the faulting mechanism from a simplicity standpoint,
so long as the mm folks don't balk at pfnmap faults. I think the vma
address space walking ends up with tracking flags to avoid redundant
mappings that get complicated and add to the ugliness. The mm handles
all that if we only use faulting and we get a simple test on fault to
proceed or trigger a SIGBUS. Using pmd/pud faults should have some
efficiency benefits as well. Thanks,
Alex