Re: [PATCH v13 019/113] KVM: TDX: create/destroy VM structure

From: Zhi Wang
Date: Sun Apr 02 2023 - 04:42:11 EST


On Wed, 29 Mar 2023 18:01:53 -0700
Isaku Yamahata <isaku.yamahata@xxxxxxxxx> wrote:

> Hi, thanks for review.
>
> On Sun, Mar 26, 2023 at 02:09:36PM +0300,
> Zhi Wang <zhi.wang.linux@xxxxxxxxx> wrote:
>
> > On Sun, 12 Mar 2023 10:55:43 -0700
> > isaku.yamahata@xxxxxxxxx wrote:
>
> > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > > index 8b02e605cfb5..3ede8a726b47 100644
> > > --- a/arch/x86/kvm/vmx/tdx.c
> > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > @@ -5,8 +5,9 @@
> > >
> > > #include "capabilities.h"
> > > #include "x86_ops.h"
> > > -#include "x86.h"
> > > #include "tdx.h"
> > > +#include "tdx_ops.h"
> > > +#include "x86.h"
> > >
> > > #undef pr_fmt
> > > #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > > @@ -46,11 +47,276 @@ int tdx_vm_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
> > > return r;
> > > }
> > >
> > > +struct tdx_info {
> > > + u8 nr_tdcs_pages;
> > > +};
> > > +
> > > +/* Info about the TDX module. */
> > > +static struct tdx_info tdx_info;
> > > +
> > > +/*
> > > + * Some TDX SEAMCALLs (TDH.MNG.CREATE, TDH.PHYMEM.CACHE.WB,
> > > + * TDH.MNG.KEY.RECLAIMID, TDH.MNG.KEY.FREEID etc) tries to acquire a global lock
> > > + * internally in TDX module. If failed, TDX_OPERAND_BUSY is returned without
> > > + * spinning or waiting due to a constraint on execution time. It's caller's
> > > + * responsibility to avoid race (or retry on TDX_OPERAND_BUSY). Use this mutex
> > > + * to avoid race in TDX module because the kernel knows better about scheduling.
> > > + */
> > > +static DEFINE_MUTEX(tdx_lock);
> > > +static struct mutex *tdx_mng_key_config_lock;
> > > +
> > > +static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
> > > +{
> > > + return pa | ((hpa_t)hkid << boot_cpu_data.x86_phys_bits);
> > > +}
> > > +
> > > +static inline bool is_td_created(struct kvm_tdx *kvm_tdx)
> > > +{
> > > + return kvm_tdx->tdr_pa;
> > > +}
> > > +
> > > +static inline void tdx_hkid_free(struct kvm_tdx *kvm_tdx)
> > > +{
> > > + tdx_guest_keyid_free(kvm_tdx->hkid);
> > > + kvm_tdx->hkid = 0;
> > > +}
> > > +
> > > +static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
> > > +{
> > > + return kvm_tdx->hkid > 0;
> > > +}
> > > +
> > > int tdx_hardware_enable(void)
> > > {
> > > return tdx_cpu_enable();
> > > }
> > >
> > > +static void tdx_clear_page(unsigned long page_pa)
> > > +{
> > > + const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
> > > + void *page = __va(page_pa);
> > > + unsigned long i;
> > > +
> > > + if (!cpu_feature_enabled(X86_FEATURE_MOVDIR64B)) {
> > > + clear_page(page);
> > > + return;
> > > + }
> >
> > Is it possbile to have a TDX machine without MOVDIR64B support? I am not sure
> > if there is any other way for the kernel to clear the posioned cache line.
> >
> > If not, there should be a warn/bug at least and check if MOVDIR64B support
> > when initializing the TDX.
>
> Because the latest TDX specification uses movdir64b, it's safe for TDX
> to assume movdir64b.
> I'll add the check to TDX initialization part and drop it from here.
>
>
> > > +void tdx_mmu_release_hkid(struct kvm *kvm)
> > > +{
> > > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > > + cpumask_var_t packages;
> > > + bool cpumask_allocated;
> > > + u64 err;
> > > + int ret;
> > > + int i;
> > > +
> > > + if (!is_hkid_assigned(kvm_tdx))
> > > + return;
> > > +
> > > + if (!is_td_created(kvm_tdx))
> > > + goto free_hkid;
> > > +
> > > + cpumask_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
> > > + cpus_read_lock();
> > > + for_each_online_cpu(i) {
> > > + if (cpumask_allocated &&
> > > + cpumask_test_and_set_cpu(topology_physical_package_id(i),
> > > + packages))
> > > + continue;
> >
> > Is this necessary to check cpumask_allocated in the while loop? if cpumask
> > is not succefully allocated, wouldn't it be better to bail out just after
> > it?
>
> No because we can't return error here. It's better to do in-efficiently freeing
> resources instead of leak.
>
> We can move the check out of loop. But it would be ugly
> (if () {cpu loop} else {cpu loop} ) and this function isn't performance
> critical. Also I think it's okay to depend on compiler optimization for loop
> invariant. My compiler didn't optimize it in this case, though.
>

Do you mean the tdh_mng_key_freeid() is still required if failing to allocate
the cpumask var and do TDH.PHYMEM_CACHE_WB(WBINVD) on each CPU?

Out of curiosity, I took a look on the TDX module source code [1], it seems TDX
module has an additional check in TDH.MNG.KEY.FREEID. TDH.MNG.VPFLUSHDONE [2]
will mark the pending wbinvd in a bitmap:

...
/**
* Create the WBINVD_BITMAP per-package.
* Set to 1 num_of_pkgs bits from the LSB
*/
global_data_ptr->kot.entries[curr_hkid].wbinvd_bitmap = global_data_ptr->pkg_config_bitmap; /* <----HERE */

// Set new TD life cycle state
tdr_ptr->management_fields.lifecycle_state = TD_BLOCKED;

// Set the proper new KOT entry state
global_data_ptr->kot.entries[curr_hkid].state = (uint8_t)KOT_STATE_HKID_FLUSHED;
...

And TDH.MNG.KEY.FREEID [3] will check if the pending WBINVD has been performed:

...
/**
* If TDH_PHYMEM_CACHE_WB was executed on all packages/cores,
* set the KOT entry, set the KOT entry state to HKID_FREE.
*/
curr_hkid = tdr_ptr->key_management_fields.hkid;
tdx_debug_assert(global_data_ptr->kot.entr/ies[curr_hkid].state == KOT_STATE_HKID_FLUSHED);
if (global_data_ptr->kot.entries[curr_hkid].wbinvd_bitmap != 0) /* HERE */
{
TDX_ERROR("CACHEWB is not complete for this HKID (=%x)\n", curr_hkid);
return_val = TDX_WBCACHE_NOT_COMPLETE;
goto EXIT;
}
...

Guess the conclusion is: if TDH.PHYMEM.CACHE.WB is not performed on each
required CPU correctly, TDH.MNG.KEY.FREEID will fail as well. A leak seems
the only option (none of us likes a leak, but...).

It would be better that:

1) Leave a comment about the finding above in the code to explain why leak
happens (It is always nice to explain the reason of a leak). One sentence
will be good enough.

2) If failing to allocate the cpumask, bail out (with the findings above)

...
cpumask_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
if (!cpumask_allocated)
return;
...

3) As the reason of the leak will be explained in tdx_mmu_release_hkid(),
leaving a pointer in the comment in tdx_vm_free() to refer to
tdx_mmu_release_hkid() would be enough, like:

...
/*
* Can't reclaim or free TD pages if teardown failed in
* tdx_mmu_release_hkid().
*/
if (is_hkid_assigned(kvm_tdx))
return;
...

[1] https://downloadmirror.intel.com/738876/tdx-module-v1.0.01.01.zip
[2] tdx-module/src/vmm_dispatcher/api_calls/tdh_mng_vpflushdone.c
[3] tdx-module/src/vmm_dispatcher/api_calls/tdh_mng_key_freeid.c

>
> > > +
> > > + /*
> > > + * We can destroy multiple the guest TDs simultaneously.
> > > + * Prevent tdh_phymem_cache_wb from returning TDX_BUSY by
> > > + * serialization.
> > > + */
> > > + mutex_lock(&tdx_lock);
> > > + ret = smp_call_on_cpu(i, tdx_do_tdh_phymem_cache_wb, NULL, 1);
> > > + mutex_unlock(&tdx_lock);
> > > + if (ret)
> > > + break;
> > > + }
> > > + cpus_read_unlock();
> > > + free_cpumask_var(packages);
> > > +
> > > + mutex_lock(&tdx_lock);
> > > + err = tdh_mng_key_freeid(kvm_tdx->tdr_pa);
> > > + mutex_unlock(&tdx_lock);
> > > + if (WARN_ON_ONCE(err)) {
> > > + pr_tdx_error(TDH_MNG_KEY_FREEID, err, NULL);
> > > + pr_err("tdh_mng_key_freeid failed. HKID %d is leaked.\n",
> > > + kvm_tdx->hkid);
> > > + return;
> > > + }
> > > +
> > > +free_hkid:
> > > + tdx_hkid_free(kvm_tdx);
> > > +}
> > > +
> > > +void tdx_vm_free(struct kvm *kvm)
> > > +{
> > > + struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > > + int i;
> > > +
> > > + /* Can't reclaim or free TD pages if teardown failed. */
> > > + if (is_hkid_assigned(kvm_tdx))
> > > + return;
> > > +
> >
> > Better to explain why, as it is common to think even the teardown failed, we
> > should still try to reclaim the pages as many as we can.
>
> Ok, here is the updated comment.
> /*
> * tdx_mmu_release_hkid() failed to reclaim HKID. Something went wrong
> * heavily with TDX module. Give up freeing TD pages. As the function
> * already warned, don't warn it again.
> */