Re: [PATCH v5 07/17] KVM: x86: add fields to struct kvm_arch for CoCo features

From: Sean Christopherson
Date: Mon Apr 08 2024 - 21:21:34 EST


On Fri, Apr 05, 2024, Rick P Edgecombe wrote:
> On Thu, 2024-04-04 at 08:13 -0400, Paolo Bonzini wrote:
> >  
> >  struct kvm_arch {
> > -       unsigned long vm_type;
> >         unsigned long n_used_mmu_pages;
> >         unsigned long n_requested_mmu_pages;
> >         unsigned long n_max_mmu_pages;
> >         unsigned int indirect_shadow_pages;
> >         u8 mmu_valid_gen;
> > +       u8 vm_type;
> > +       bool has_private_mem;
> > +       bool has_protected_state;
>
> I'm a little late to this conversation, so hopefully not just complicating
> things. But why not deduce has_private_mem and has_protected_state from the
> vm_type during runtime? Like if kvm.arch.vm_type was instead a bit mask with
> the bit position of the KVM_X86_*_VM set, kvm_arch_has_private_mem() could
> bitwise-and with a compile time mask of vm_types that have primate memory.
> This also prevents it from ever transitioning through non-nonsensical states
> like vm_type == KVM_X86_TDX_VM, but !has_private_memory, so would be a little
> more robust.

LOL, time is a circle, or something like that. Paolo actually did this in v2[*],
and I objected, vociferously.

KVM advertises VM types to userspace via a 32-bit field, one bit per type. So
without more uAPI changes, the VM type needs to be <=31. KVM could embed the
"has private memory" information into the type, but then we cut down on the number
of possible VM types *and* bleed has_private_memory into KVM's ABI.

While it's unlikely KVM will ever support TDX without has_private_memory, it's
entirely possible that KVM could add support for an existing VM "base" type that
doesn't currently support private memory. E.g. with some massaging, KVM could
support private memory for SEV and SEV-ES. And then we need to add an entirely
new VM type just so that KVM can let it use private memory.

Obviously KVM could shove in bits after the fact, e.g. store vm_type as a u64
instead of u32 (or u8 as in this patch), but then what's the point? Burning a
byte instead of a bit for per-VM flag is a complete non-issue, and booleans tend
to yield code that's easier to read and easier to maintain.

[*] https://lore.kernel.org/all/ZdjL783FazB6V6Cy@xxxxxxxxxx

> Partly why I ask is there is logic in the x86 MMU TDX changes that tries to
> be generic but still needs special handling for it. The current solution is
> to look at kvm_gfn_shared_mask() as TDX is the only vm type that sets it, but
> Isaku and I were discussing if we should check something else, that didn't
> appear to be tying together to unrelated concepts:
> https://lore.kernel.org/kvm/20240319235654.GC1994522@xxxxxxxxxxxxxxxxxxxxx/
>
> Since it's down the mail, the relevant snippet:
> "
> > > void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
> > > struct kvm_memory_slot *slot)
> > > {
> > > - kvm_mmu_zap_all_fast(kvm);
> > > + if (kvm_gfn_shared_mask(kvm))

Whatever you do that is TDX specific and an internal KVM thing is likely the wrong
thing :-)

The main reason KVM doesn't do a targeted zap on memslot removal is because of ABI
baggage that we _think_ is limited to interaction with VFIO. Since KVM doesn't
have any ABI for TDX *or* SNP, I want to at least entertain the option of doing
a target zap for SNP as well as TDX, even though it's only truly "necessary" for
TDX, in quotes because it's not strictly necessary, e.g. KVM could BLOCK the S-EPT
entries without fully removing the mappings.

Whether or not targeted zapping is optimal for SNP (or any VM type) is very much
TBD, and likely highly dependent on use case, but at the same time it would be
nice to not rule it out completely.

E.g. ChromeOS currently has a use case where they frequently delete and recreate
a 2GiB (give or take) memslot. For that use case, zapping _just_ that memslot is
likely far superious than blasting and rebuilding the entire VM. But if userspace
deletes a 1TiB for some reason, e.g. for memory unplug?, then the fast zap is
probably better, even though it requires rebuilding all SPTEs.

> > There seems to be an attempt to abstract away the existence of Secure-
> > EPT in mmu.c, that is not fully successful. In this case the code
> > checks kvm_gfn_shared_mask() to see if it needs to handle the zapping
> > in a way specific needed by S-EPT. It ends up being a little confusing
> > because the actual check is about whether there is a shared bit. It
> > only works because only S-EPT is the only thing that has a
> > kvm_gfn_shared_mask().
> >
> > Doing something like (kvm->arch.vm_type == KVM_X86_TDX_VM) looks wrong,
> > but is more honest about what we are getting up to here. I'm not sure
> > though, what do you think?
>
> Right, I attempted and failed in zapping case. This is due to the restriction
> that the Secure-EPT pages must be removed from the leaves. the VMX case (also
> NPT, even SNP) heavily depends on zapping root entry as optimization.

As above, it's more nuanced than that. KVM has come to depend on the fast zap,
but it got that way *because* KVM has historical zapped everything, and userspace
has (unknowingly) relied on that behavior.

> I can think of
> - add TDX check. Looks wrong
> - Use kvm_gfn_shared_mask(kvm). confusing

Ya, even if we end up making it a hardcoded TDX thing, dress it up a bit. E.g.
even if KVM checks for a shared mask under the hood, add a helper to capture the
logic, e.g. kvm_zap_all_sptes_on_memslot_deletion(kvm).

> - Give other name for this check like zap_from_leafs (or better name?)
> The implementation is same to kvm_gfn_shared_mask() with comment.
> - Or we can add a boolean variable to struct kvm

If we _don't_ hardcode the behavior, a per-memslot flag or a per-VM capability
(and thus boolean) is likely the way to go. My off-the-cuff vote is probably for
a per-memslot flag.