Re: [RFC PATCH 00/42] Sharing KVM TDP to IOMMU

From: Yan Zhao
Date: Tue Dec 05 2023 - 01:24:34 EST


On Mon, Dec 04, 2023 at 11:22:49AM -0800, Sean Christopherson wrote:
> On Mon, Dec 04, 2023, Jason Gunthorpe wrote:
> > On Mon, Dec 04, 2023 at 09:00:55AM -0800, Sean Christopherson wrote:

> > > I'm not convinced that memory consumption is all that interesting. If a VM is
> > > mapping the majority of memory into a device, then odds are good that the guest
> > > is backed with at least 2MiB page, if not 1GiB pages, at which point the memory
> > > overhead for pages tables is quite small, especially relative to the total amount
> > > of memory overheads for such systems.
> >
> > AFAIK the main argument is performance. It is similar to why we want
> > to do IOMMU SVA with MM page table sharing.
> >
> > If IOMMU mirrors/shadows/copies a page table using something like HMM
> > techniques then the invalidations will mark ranges of IOVA as
> > non-present and faults will occur to trigger hmm_range_fault to do the
> > shadowing.
> >
> > This means that pretty much all IO will always encounter a non-present
> > fault, certainly at the start and maybe worse while ongoing.
> >
> > On the other hand, if we share the exact page table then natural CPU
> > touches will usually make the page present before an IO happens in
> > almost all cases and we don't have to take the horribly expensive IO
> > page fault at all.
>
> I'm not advocating mirroring/copying/shadowing page tables between KVM and the
> IOMMU. I'm suggesting managing IOMMU page tables mostly independently, but reusing
> KVM code to do so.
>
> I wouldn't even be opposed to KVM outright managing the IOMMU's page tables. E.g.
> add an "iommu" flag to "union kvm_mmu_page_role" and then the implementation looks
> rather similar to this series.
Yes, very similar to current implementation, which added a "exported" flag to
"union kvm_mmu_page_role".
>
> What terrifies is me sharing page tables between the CPU and the IOMMU verbatim.
>
> Yes, sharing page tables will Just Work for faulting in memory, but the downside
> is that _when_, not if, KVM modifies PTEs for whatever reason, those modifications
> will also impact the IO path. My understanding is that IO page faults are at least
> an order of magnitude more expensive than CPU page faults. That means that what's
> optimal for CPU page tables may not be optimal, or even _viable_, for IOMMU page
> tables.
>
> E.g. based on our conversation at LPC, write-protecting guest memory to do dirty
> logging is not a viable option for the IOMMU because the latency of the resulting
> IOPF is too high. Forcing KVM to use D-bit dirty logging for CPUs just because
> the VM has passthrough (mediated?) devices would be likely a non-starter.
>
> One of my biggest concerns with sharing page tables between KVM and IOMMUs is that
> we will end up having to revert/reject changes that benefit KVM's usage due to
> regressing the IOMMU usage.
>
As the TDP shared by IOMMU is marked by KVM, could we limit the changes (that
benefic KVM but regress IOMMU) to TDPs not shared?

> If instead KVM treats IOMMU page tables as their own thing, then we can have
> divergent behavior as needed, e.g. different dirty logging algorithms, different
> software-available bits, etc. It would also allow us to define new ABI instead
> of trying to reconcile the many incompatibilies and warts in KVM's existing ABI.
> E.g. off the top of my head:
>
> - The virtual APIC page shouldn't be visible to devices, as it's not "real" guest
> memory.
>
> - Access tracking, i.e. page aging, by making PTEs !PRESENT because the CPU
> doesn't support A/D bits or because the admin turned them off via KVM's
> enable_ept_ad_bits module param.
>
> - Write-protecting GFNs for shadow paging when L1 is running nested VMs. KVM's
> ABI can be that device writes to L1's page tables are exempt.
>
> - KVM can exempt IOMMU page tables from KVM's awful "drop all page tables if
> any memslot is deleted" ABI.
>
> > We were not able to make bi-dir notifiers with with the CPU mm, I'm
> > not sure that is "relatively easy" :(
>
> I'm not suggesting full blown mirroring, all I'm suggesting is a fire-and-forget
> notifier for KVM to tell IOMMUFD "I've faulted in GFN A, you might want to do the
> same".
>
> It wouldn't even necessarily need to be a notifier per se, e.g. if we taught KVM
> to manage IOMMU page tables, then KVM could simply install mappings for multiple
> sets of page tables as appropriate.
Not sure which approach below is the one you are referring to by "fire-and-forget
notifier" and "if we taught KVM to manage IOMMU page tables".

Approach A:
1. User space or IOMMUFD tells KVM which address space to share to IOMMUFD.
2. KVM create a special TDP, and maps this page table whenever a GFN in the
specified address space is faulted to PFN in vCPU side.
3. IOMMUFD imports this special TDP and receives zaps notification from KVM.
KVM will only send the zap notification for memslot removal or for certain MMU
zap notifications

Approach B:
1. User space or IOMMUFD tells KVM which address space to notify.
2. KVM notifies IOMMUFD whenever a GFN in the specified address space is faulted
to PFN in vCPU side.
3. IOMMUFD translates GFN to PFN in its own way (though VMA or through certain
new memfd interface), and maps IO PTEs by itself.
4. IOMMUFD zaps IO PTEs when a memslot is removed and interacts with MMU notifier
for zap notification in the primary MMU.


If approach A is preferred, could vCPUs also be allowed to attach to this
special TDP in VMs that don't suffer from NX hugepage mitigation, and do not
want live migration with passthrough devices, and don't rely on write-protection
for nested VMs.