Re: [RFC PATCH 00/42] Sharing KVM TDP to IOMMU

From: Yan Zhao
Date: Mon Dec 04 2023 - 21:01:09 EST


On Mon, Dec 04, 2023 at 08:38:17AM -0800, Sean Christopherson wrote:
> On Mon, Dec 04, 2023, Jason Gunthorpe wrote:
> > On Sat, Dec 02, 2023 at 05:12:11PM +0800, Yan Zhao wrote:
> > > In this series, term "exported" is used in place of "shared" to avoid
> > > confusion with terminology "shared EPT" in TDX.
> > >
> > > The framework contains 3 main objects:
> > >
> > > "KVM TDP FD" object - The interface of KVM to export TDP page tables.
> > > With this object, KVM allows external components to
> > > access a TDP page table exported by KVM.
> >
> > I don't know much about the internals of kvm, but why have this extra
> > user visible piece?
>
> That I don't know, I haven't looked at the gory details of this RFC.
>
> > Isn't there only one "TDP" per kvm fd?
>
> No. In steady state, with TDP (EPT) enabled and assuming homogeneous capabilities
> across all vCPUs, KVM will have 3+ sets of TDP page tables *active* at any given time:
>
> 1. "Normal"
> 2. SMM
> 3-N. Guest (for L2, i.e. nested, VMs)
Yes, the reason to introduce KVM TDP FD is to let KVM know which TDP the user
wants to export(share).

For as_id=0 (which is currently the only supported as_id to share), a TDP with
smm=0, guest_mode=0 will be chosen.

Upon receiving the KVM_CREATE_TDP_FD ioctl, KVM will try to find an existing
TDP root with role specified by as_id 0. If there's existing TDP with the target
role found, KVM will just export this one; if no existing one found, KVM will
create a new TDP root in non-vCPU context.
Then, KVM will mark the exported TDP as "exported".


tdp_mmu_roots
|
role | smm | guest_mode +------+-----------+----------+
------|----------------- | | | |
0 | 0 | 0 ==> address space 0 | v v v
1 | 1 | 0 | .--------. .--------. .--------.
2 | 0 | 1 | | root | | root | | root |
3 | 1 | 1 | |(role 1)| |(role 2)| |(role 3)|
| '--------' '--------' '--------'
| ^
| | create or get .------.
| +--------------------| vCPU |
| fault '------'
| smm=1
| guest_mode=0
|
(set root as exported) v
.--------. create or get .---------------. create or get .------.
| TDP FD |------------------->| root (role 0) |<-----------------| vCPU |
'--------' fault '---------------' fault '------'
. smm=0
. guest_mode=0
.
non-vCPU context <---|---> vCPU context
.
.

No matter the TDP is exported or not, vCPUs just load TDP root according to its
vCPU modes.
In this way, KVM is able to share the TDP in KVM address space 0 to IOMMU side.

> The number of possible TDP page tables used for nested VMs is well bounded, but
> since devices obviously can't be nested VMs, I won't bother trying to explain the
> the various possibilities (nested NPT on AMD is downright ridiculous).
In future, if possible, I wonder if we can export an TDP for nested VM too.
E.g. in scenarios where TDP is partitioned, and one piece is for L2 VM.
Maybe we can specify that and tell KVM the very piece of TDP to export.

> Nested virtualization aside, devices are obviously not capable of running in SMM
> and so they all need to use the "normal" page tables.
>
> I highlighted "active" above because if _any_ memslot is deleted, KVM will invalidate
> *all* existing page tables and rebuild new page tables as needed. So over the
> lifetime of a VM, KVM could theoretically use an infinite number of page tables.
Right. In patch 36, the TDP root which is marked as "exported" will be exempted
from "invalidate". Instead, an "exported" TDP just zaps all leaf entries upon
memory slot removal.
That is to say, for an exported TDP, it can be "active" until it's unmarked as
exported.