Re: [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls

From: Vishal Annapurve
Date: Fri Jun 27 2025 - 11:18:03 EST


On Thu, Jun 26, 2025 at 9:50 PM Alexey Kardashevskiy <aik@xxxxxxx> wrote:
>
>
>
> On 25/6/25 00:10, Vishal Annapurve wrote:
> > On Tue, Jun 24, 2025 at 6:08 AM Jason Gunthorpe <jgg@xxxxxxxx> wrote:
> >>
> >> On Tue, Jun 24, 2025 at 06:23:54PM +1000, Alexey Kardashevskiy wrote:
> >>
> >>> Now, I am rebasing my RFC on top of this patchset and it fails in
> >>> kvm_gmem_has_safe_refcount() as IOMMU holds references to all these
> >>> folios in my RFC.
> >>>
> >>> So what is the expected sequence here? The userspace unmaps a DMA
> >>> page and maps it back right away, all from the userspace? The end
> >>> result will be the exactly same which seems useless. And IOMMU TLB
> >
> > As Jason described, ideally IOMMU just like KVM, should just:
> > 1) Directly rely on guest_memfd for pinning -> no page refcounts taken
> > by IOMMU stack
> > 2) Directly query pfns from guest_memfd for both shared/private ranges
> > 3) Implement an invalidation callback that guest_memfd can invoke on
> > conversions.

Conversions and truncations both.

> >
> > Current flow:
> > Private to Shared conversion via kvm_gmem_convert_range() -
> > 1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
> > on each bound memslot overlapping with the range
> > -> KVM has the concept of invalidation_begin() and end(),
> > which effectively ensures that between these function calls, no new
> > EPT/NPT entries can be added for the range.
> > 2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
> > actually unmaps the KVM SEPT/NPT entries.
> > 3) guest_memfd invokes kvm_gmem_execute_work() which updates the
> > shareability and then splits the folios if needed
> >
> > Shared to private conversion via kvm_gmem_convert_range() -
> > 1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
> > on each bound memslot overlapping with the range
> > 2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
> > actually unmaps the host mappings which will unmap the KVM non-seucure
> > EPT/NPT entries.
> > 3) guest_memfd invokes kvm_gmem_execute_work() which updates the
> > shareability and then merges the folios if needed.
> >
> > ============================
> >
> > For IOMMU, could something like below work?
> >
> > * A new UAPI to bind IOMMU FDs with guest_memfd ranges
>
> Done that.
>
> > * VFIO_DMA_MAP/UNMAP operations modified to directly fetch pfns from
> > guest_memfd ranges using kvm_gmem_get_pfn()
>
> This API imho should drop the confusing kvm_ prefix.
>
> > -> kvm invokes kvm_gmem_is_private() to check for the range
> > shareability, IOMMU could use the same or we could add an API in gmem
> > that takes in access type and checks the shareability before returning
> > the pfn.
>
> Right now I cutnpasted kvm_gmem_get_folio() (which essentially is filemap_lock_folio()/filemap_alloc_folio()/__filemap_add_folio()) to avoid new links between iommufd.ko and kvm.ko. It is probably unavoidable though.

I don't think that's the way to avoid links between iommufd.ko and
kvm.ko. Cleaner way probably is to have gmem logic built-in and allow
runtime registration of invalidation callbacks from KVM/IOMMU
backends. Need to think about this more.

>
>
> > * IOMMU stack exposes an invalidation callback that can be invoked by
> > guest_memfd.
> >
> > Private to Shared conversion via kvm_gmem_convert_range() -
> > 1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
> > on each bound memslot overlapping with the range
> > 2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
> > actually unmaps the KVM SEPT/NPT entries.
> > -> guest_memfd invokes IOMMU invalidation callback to zap
> > the secure IOMMU entries.
> > 3) guest_memfd invokes kvm_gmem_execute_work() which updates the
> > shareability and then splits the folios if needed
> > 4) Userspace invokes IOMMU map operation to map the ranges in
> > non-secure IOMMU.
> >
> > Shared to private conversion via kvm_gmem_convert_range() -
> > 1) guest_memfd invokes kvm_gmem_invalidate_begin() for the ranges
> > on each bound memslot overlapping with the range
> > 2) guest_memfd invokes kvm_gmem_convert_should_proceed() which
> > actually unmaps the host mappings which will unmap the KVM non-seucure
> > EPT/NPT entries.
> > -> guest_memfd invokes IOMMU invalidation callback to zap the
> > non-secure IOMMU entries.
> > 3) guest_memfd invokes kvm_gmem_execute_work() which updates the
> > shareability and then merges the folios if needed.
> > 4) Userspace invokes IOMMU map operation to map the ranges in secure IOMMU.
>
>
> Alright (although this zap+map is not necessary on the AMD hw).

IMO guest_memfd ideally should not directly interact or cater to arch
specific needs, it should implement a mechanism that works for all
archs. KVM/IOMMU implement invalidation callbacks and have all the
architecture specific knowledge to take the right decisions.

>
>
> > There should be a way to block external IOMMU pagetable updates while
> > guest_memfd is performing conversion e.g. something like
> > kvm_invalidate_begin()/end().
> >
> >>> is going to be flushed on a page conversion anyway (the RMPUPDATE
> >>> instruction does that). All this is about AMD's x86 though.
> >>
> >> The iommu should not be using the VMA to manage the mapping. It should
> >
> > +1.
>
> Yeah, not doing this already, because I physically cannot map gmemfd's memory in IOMMU via VMA (which allocates memory via gup() so wrong memory is mapped in IOMMU). Thanks,
>
>
> >> be directly linked to the guestmemfd in some way that does not disturb
> >> its operations. I imagine there would be some kind of invalidation
> >> callback directly to the iommu.
> >>
> >> Presumably that invalidation call back can include a reason for the
> >> invalidation (addr change, shared/private conversion, etc)
> >>
> >> I'm not sure how we will figure out which case is which but guestmemfd
> >> should allow the iommu to plug in either invalidation scheme..
> >>
> >> Probably invalidation should be a global to the FD thing, I imagine
> >> that once invalidation is established the iommu will not be
> >> incrementing page refcounts.
> >
> > +1.
>
> Alright. Thanks for the comments.
>
> >
> >>
> >> Jason
>
> --
> Alexey
>