RE: [RFC PATCH] clocksource: hyper-v: Enable the tsc_page for a TDX VM in TD mode

From: Michael Kelley
Date: Fri May 24 2024 - 18:44:22 EST


From: Dexuan Cui <decui@xxxxxxxxxxxxx> Sent: Friday, May 24, 2024 1:46 AM
>
> > From: Dave Hansen <dave.hansen@xxxxxxxxx>
> > Sent: Thursday, May 23, 2024 7:26 AM
> > [...]
> > On 5/22/24 19:24, Dexuan Cui wrote:
> > ...
> > > +static bool noinstr intel_cc_platform_td_l2(enum cc_attr attr)
> > > +{
> > > + switch (attr) {
> > > + case CC_ATTR_GUEST_MEM_ENCRYPT:
> > > + case CC_ATTR_MEM_ENCRYPT:
> > > + return true;
> > > + default:
> > > + return false;
> > > + }
> > > +}
> > > +
> > > static bool noinstr intel_cc_platform_has(enum cc_attr attr)
> > > {
> > > + if (tdx_partitioned_td_l2)
> > > + return intel_cc_platform_td_l2(attr);
> > > +
> > > switch (attr) {
> > > case CC_ATTR_GUEST_UNROLL_STRING_IO:
> > > case CC_ATTR_HOTPLUG_DISABLED:
> >
> > On its face, this _looks_ rather troubling. It just hijacks all of the
> > attributes. It totally bifurcates the code. Anything that gets added
> > to intel_cc_platform_has() now needs to be considered for addition to
> > intel_cc_platform_td_l2().
>
> Maybe the bifurcation is necessary? TD mode is different from
> Partitioned TD mode (L2), after all. Another reason for the bifurcation
> is: currently online/offline'ing is disallowed for a TD VM, but actually
> Hyper-V is able to support CPU online/offline'ing for a TD VM in
> Partitioned TD mode (L2) -- how can we allow online/offline'ing for such
> a VM?
>
> BTW, the bifurcation code is copied from amd_cc_platform_has(), where
> an AMD SNP VM may run in the vTOM mode.
>
> > > --- a/arch/x86/mm/mem_encrypt_amd.c
> > > +++ b/arch/x86/mm/mem_encrypt_amd.c
> > ...
> > > @@ -529,7 +530,7 @@ void __init mem_encrypt_free_decrypted_mem(void)
> > > * CC_ATTR_MEM_ENCRYPT, aren't necessarily equivalent in a Hyper-V VM
> > > * using vTOM, where sme_me_mask is always zero.
> > > */
> > > - if (sme_me_mask) {
> > > + if (sme_me_mask || (cc_vendor == CC_VENDOR_INTEL && !tdx_partitioned_td_l2)) {

FWIW, the above won't work in a kernel built with CONFIG_TDX_GUEST=y
but CONFIG_AMD_MEM_ENCRYPT=n. mem_encrypt_free_decrypted_mem()
in arch/x86/mm/mem_encrypt_amd.c won't get built, and an empty stub is used.

> > > r = set_memory_encrypted(vaddr, npages);
> > > if (r) {
> > > pr_warn("failed to free unused decrypted pages\n");
> >
> > If _ever_ there were a place for a new CC_ attribute, this would be it.
> Not sure how to add a new CC attribute for the __bss_decrypted support.
>
> For the cpu online/offline'ing support, I'm not sure how to add a new
> CC attribute and not introduce the bifurcation.
>
> > It's also a bit concerning that now we've got a (cc_vendor ==
> > CC_VENDOR_INTEL) check in an amd.c file.
> I agree my change here is ugly...
> Currently the __bss_decrypted support is only used for SNP.
> Not sure if we should get it to work for TDX as well.
>
> > So all of that on top of Kirill's "why do we need this in the first
> > place" questions leave me really scratching my head on this one.
> Probably I'll just use local APIC timer in such a VM or delay enabling
> Hyper-V TSC page to a later place where set_memory_decrypted()
> works for me. However, I still would like to find out how to allow
> CPU online/offline'ing for a TDX VM in Partitioned TD mode (L2).
>

My thoughts:

__bss_decrypted is named as if it applies to any CoCo VM, but really
it is specific to AMD SEV. It was originally used for a GHCB page, which
is SEV-specific, and then it proved to be convenient for the Hyper-V TSC
page. Ideally, we could fix __bss_decrypted to work generally in a
TDX VM without any dependency on code specific to a hypervisor. But
looking at some of the details, that may be non-trivial.

A narrower solution is to remove the Hyper-V TSC page from
__bss_decrypted, and use Hyper-V specific code on both TDX and
SEV-SNP to decrypt just that page (not the entire __bss_decrypted),
based on whether the Hyper-V guest is running with a paravisor.
>From Dexuan's patch, it looks like set_memory_decrypted()
works on TDX at the time that ms_hyperv_init_platform() runs.
Does it also work on SEV-SNP? The code in kvm_init_platform()
uses early_set_mem_enc_dec_hypercall() with
kvm_sev_hc_page_enc_status(), which is SEV only. So maybe
the normal set_memory_decrypted() doesn't work on SEV at
that point, though I'm not at all clear on what kvm_init_platform is
trying to do. Shouldn't __bss_decrypted already be set up correctly?

The issue of taking CPUs offline is separate. Is the inability to take
a CPU offline with TDX an architectural limitation? Or just a
current Linux implementation limitation? And what about in an
L2 TDX VM? If the existence of a limitation in a L2 TDX VM is
dependent on the hypervisor/paravisor, then can cc_platform_has()
check some architectural flag (that's independent of the host
hypervisor) to know if it is running in an L2 TDX VM and return false
for CC_ATTR_HOTPLUG_DISABLED? If a host/paravisor combo doesn't
allow taking a L2 TDX VM CPU offline, then it would be up to that
combo to implement the appropriate restriction. It's not hard to add
a CPUHP state that would prevent it.

Michael