Re: [RFC PATCH 00/42] Sharing KVM TDP to IOMMU

From: Jason Gunthorpe
Date: Mon Dec 04 2023 - 14:51:06 EST


On Mon, Dec 04, 2023 at 11:22:49AM -0800, Sean Christopherson wrote:
> On Mon, Dec 04, 2023, Jason Gunthorpe wrote:
> > On Mon, Dec 04, 2023 at 09:00:55AM -0800, Sean Christopherson wrote:
> >
> > > There are more approaches beyond having IOMMUFD and KVM be
> > > completely separate entities. E.g. extract the bulk of KVM's "TDP
> > > MMU" implementation to common code so that IOMMUFD doesn't need to
> > > reinvent the wheel.
> >
> > We've pretty much done this already, it is called "hmm" and it is what
> > the IO world uses. Merging/splitting huge page is just something that
> > needs some coding in the page table code, that people want for other
> > reasons anyhow.
>
> Not really. HMM is a wildly different implementation than KVM's TDP MMU. At a
> glance, HMM is basically a variation on the primary MMU, e.g. deals with VMAs,
> runs under mmap_lock (or per-VMA locks?), and faults memory into the primary MMU
> while walking the "secondary" HMM page tables.

hmm supports the essential idea of shadowing parts of the primary
MMU. This is a big chunk of what kvm is doing, just differently.

> KVM's TDP MMU (and all of KVM's flavors of MMUs) is much more of a pure secondary
> MMU. The core of a KVM MMU maps GFNs to PFNs, the intermediate steps that involve
> the primary MMU are largely orthogonal. E.g. getting a PFN from guest_memfd
> instead of the primary MMU essentially boils down to invoking kvm_gmem_get_pfn()
> instead of __gfn_to_pfn_memslot(), the MMU proper doesn't care how the PFN was
> resolved. I.e. 99% of KVM's MMU logic has no interaction with the primary MMU.

Hopefully the memfd stuff we be generalized so we can use it in
iommufd too, without relying on kvm. At least the first basic stuff
should be doable fairly soon.

> I'm not advocating mirroring/copying/shadowing page tables between KVM and the
> IOMMU. I'm suggesting managing IOMMU page tables mostly independently, but reusing
> KVM code to do so.

I guess from my POV, if KVM has two copies of the logically same radix
tree then that is fine too.

> Yes, sharing page tables will Just Work for faulting in memory, but the downside
> is that _when_, not if, KVM modifies PTEs for whatever reason, those modifications
> will also impact the IO path. My understanding is that IO page faults are at least
> an order of magnitude more expensive than CPU page faults. That means that what's
> optimal for CPU page tables may not be optimal, or even _viable_, for IOMMU page
> tables.

Yes, you wouldn't want to do some of the same KVM techniques today in
a shared mode.

> E.g. based on our conversation at LPC, write-protecting guest memory to do dirty
> logging is not a viable option for the IOMMU because the latency of the resulting
> IOPF is too high. Forcing KVM to use D-bit dirty logging for CPUs just because
> the VM has passthrough (mediated?) devices would be likely a
> non-starter.

Yes

> One of my biggest concerns with sharing page tables between KVM and IOMMUs is that
> we will end up having to revert/reject changes that benefit KVM's usage due to
> regressing the IOMMU usage.

It is certainly a strong argument

> I'm not suggesting full blown mirroring, all I'm suggesting is a fire-and-forget
> notifier for KVM to tell IOMMUFD "I've faulted in GFN A, you might want to do the
> same".

If we say the only thing this works with is the memfd version of KVM,
could we design the memfd stuff to not have the same challenges with
mirroring as normal VMAs?

> It wouldn't even necessarily need to be a notifier per se, e.g. if we taught KVM
> to manage IOMMU page tables, then KVM could simply install mappings for multiple
> sets of page tables as appropriate.

This somehow feels more achievable to me since KVM already has all the
code to handle multiple TDPs, having two parallel ones is probably
much easier than trying to weld KVM to a different page table
implementation through some kind of loose coupled notifier.

Jason