Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure

From: Sean Christopherson
Date: Fri Jan 20 2023 - 19:12:21 EST


On Fri, Jan 20, 2023, David Matlack wrote:
> On Thu, Jan 19, 2023 at 4:16 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
> >
> > On Thu, Jan 19, 2023, Huang, Kai wrote:
> > > On Thu, 2023-01-19 at 21:36 +0000, Sean Christopherson wrote:
> > > > The least invasive idea I have is expand the TDP MMU's concept of "frozen" SPTEs
> > > > and freeze (a.k.a. lock) the SPTE (KVM's mirror) until the corresponding S-EPT
> > > > update completes.a

Doh, the proposed patches already do a freeze+unfreeze. Sorry, I obviously didn't
read the most recent changelog and haven't looked at the code in a few versions.

> > > This will introduce another "having-to-wait while SPTE is frozen" problem I
> > > think, which IIUC means (one way is) you have to do some loop and retry, perhaps
> > > similar to yield_safe.
> >
> > Yes, but because the TDP MMU already freezes SPTEs (just for a shorter duration),
> > I'm 99% sure all of the affected flows already know how to yield/bail when necessary.
> >
> > The problem with the zero-step mitigation is that it could (theoretically) cause
> > a "busy" error on literally any accesses, which makes it infeasible for KVM to have
> > sane behavior. E.g. freezing SPTEs to avoid the ordering issues isn't necessary
> > when holding mmu_lock for write, whereas the zero-step madness brings everything
> > into play.
>
> (I'm still ramping up on TDX so apologies in advance if the following
> is totally off base.)
>
> The complexity, and to a lesser extent the memory overhead, of
> mirroring Secure EPT tables with the TDP MMU makes me wonder if it is
> really worth it. Especially since the initial TDX support has so many
> constraints that would seem to allow a simpler implementation: all
> private memory is pinned, no live migration support, no test/clear
> young notifiers, etc.
>
> For the initial version of KVM TDX support, what if we implemented the
> Secure EPT management entirely off to the side? i.e. Not on top of the
> TDP MMU. For example, write TDX-specific routines for:
>
> - Fully populating the Secure EPT tree some time during VM creation.

Fully populating S-EPT is likely not a viable option for performance reasons. The
number of SEAMCALLS needed for a large VM would be quite enormous, and to avoid
faults entirely the backing memory would also need to be pre-allocated.

> - Tearing down the Secure EPT tree during VM destruction.
> - Support for unmapping/mapping specific regions of the Secure EPT
> tree for private<->shared conversions.
>
> With that in place, KVM never would need to handle a fault on a Secure
> EPT mapping. Any fault (e.g. due to an in-progress private<->shared
> conversion) can just return back to the guest to retry the memory
> access until the operation is complete.

I don't think this will work. Or at least, it's not that simple. TDX and SNP
both require the host to support implicit conversions from shared => private,
i.e. KVM needs to be able to handle faults on private memory at any time. KVM
could rely on the attributes xarray to know if KVM should resume the guest or
exit to userspace, but at some point KVM needs to know whether or an entry has
been installed. We could work around that by adding a hooking to "set attributes"
to populate S-EPT entries, but I'm not sure that would yield a simpler implementation
without sacrificing performance.

Thinking more about this, I do believe there is a simpler approach than freezing
S-EPT entries. If we tweak the TDP MMU rules for TDX to require mmu_lock be held
for write when overwiting a present SPTE (with a live VM), then the problematic
cases go away. As you point out, all of the restrictions on TDX private memory
means we're most of the way there.

Explicit unmap flows already take mmu_lock for write, zapping/splitting for dirty
logging isn't a thing (yet), and the S-EPT root is fixed, i.e. kvm_tdp_mmu_put_root()
should never free S-EPT tables until the VM is dead.

In other words, the only problematic flow is kvm_tdp_mmu_map(). Specifically,
KVM will overwrite a present SPTE only to demote/promote an existing entry, i.e.
to split a hugepage or collapse into a hugepage. And I believe it's not too
difficult to ensure KVM never does in-place promotion/demotion:

- Private memory doesn't rely on host userspace page tables, i.e. KVM won't
try to promote to a hugepage because a hugepage was created in the host.

- Private memory doesn't (yet) support page migration, so again KVM won't try
to promote because a hugepage is suddenly possible.

- Shadow paging isn't compatible, i.e. disallow_lpage won't change dynamically
except for when there are mixed attributes, and toggling private/shared in
the attributes requires zapping with mmu_lock held for write.

- The idiotic non-coherent DMA + guest MTRR interaction can be disallowed (which
I assume is already the case for TDX).

- Disallow toggling the NX hugepage mitigation when TDX is allowed.

Races to install leaf SPTEs will yield a "loser" either on the spurious check in
tdp_mmu_map_handle_target_level():

if (new_spte == iter->old_spte)
ret = RET_PF_SPURIOUS;

or on the CMPXCHG in tdp_mmu_set_spte_atomic().

So I _think_ the race I described can't actually happen in practice, and if it
does happen, then we have a bug somewhere else. Though that raises the question
of why the proposed patches do the freeze+unfreeze. If it's just the VM destroy
case, then that should be easy enough to special case.


Intel folks,

Do you happen to know exactly what scenario prompted adding the freeze+unfreeze
code? Is there something I'm forgetting/missing, or is it possible we can go
with a simpler implementation?