Re: [PATCH v19 038/130] KVM: TDX: create/destroy VM structure

From: Isaku Yamahata
Date: Wed Apr 03 2024 - 13:48:39 EST


On Thu, Mar 28, 2024 at 12:33:35PM +1300,
"Huang, Kai" <kai.huang@xxxxxxxxx> wrote:

> > + kvm_tdx->tdr_pa = tdr_pa;
> > +
> > + for_each_online_cpu(i) {
> > + int pkg = topology_physical_package_id(i);
> > +
> > + if (cpumask_test_and_set_cpu(pkg, packages))
> > + continue;
> > +
> > + /*
> > + * Program the memory controller in the package with an
> > + * encryption key associated to a TDX private host key id
> > + * assigned to this TDR. Concurrent operations on same memory
> > + * controller results in TDX_OPERAND_BUSY. Avoid this race by
> > + * mutex.
> > + */
>
> IIUC the race can only happen when you are creating multiple TDX guests
> simulatenously? Please clarify this in the comment.
>
> And I even don't think you need all these TDX module details:
>
> /*
> * Concurrent run of TDH.MNG.KEY.CONFIG on the same
> * package resluts in TDX_OPERAND_BUSY. When creating
> * multiple TDX guests simultaneously this can run
> * concurrently. Take the per-package lock to
> * serialize.
> */

As pointed by Chao, those mutex will be dropped.
https://lore.kernel.org/kvm/ZfpwIespKy8qxWWE@chao-email/
Also we would simplify cpu masks to track which package is online/offline,
which cpu to use for each package somehow.


> > + mutex_lock(&tdx_mng_key_config_lock[pkg]);
> > + ret = smp_call_on_cpu(i, tdx_do_tdh_mng_key_config,
> > + &kvm_tdx->tdr_pa, true);
> > + mutex_unlock(&tdx_mng_key_config_lock[pkg]);
> > + if (ret)
> > + break;
> > + }
> > + cpus_read_unlock();
> > + free_cpumask_var(packages);
> > + if (ret) {
> > + i = 0;
> > + goto teardown;
> > + }
> > +
> > + kvm_tdx->tdcs_pa = tdcs_pa;
> > + for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
> > + err = tdh_mng_addcx(kvm_tdx->tdr_pa, tdcs_pa[i]);
> > + if (err == TDX_RND_NO_ENTROPY) {
> > + /* Here it's hard to allow userspace to retry. */
> > + ret = -EBUSY;
> > + goto teardown;
> > + }
> > + if (WARN_ON_ONCE(err)) {
> > + pr_tdx_error(TDH_MNG_ADDCX, err, NULL);
> > + ret = -EIO;
> > + goto teardown;
> > + }
> > + }
> > +
> > + /*
> > + * Note, TDH_MNG_INIT cannot be invoked here. TDH_MNG_INIT requires a dedicated
> > + * ioctl() to define the configure CPUID values for the TD.
> > + */
>
> Then, how about renaming this function to __tdx_td_create()?

So do we want to rename also ioctl name for consistency?
i.e. KVM_TDX_INIT_VM => KVM_TDX_CREATE_VM.

I don't have strong opinion those names. Maybe
KVM_TDX_{INIT, CREATE, or CONFIG}_VM?
And we can rename the function name to match it.

> > + return 0;
> > +
> > + /*
> > + * The sequence for freeing resources from a partially initialized TD
> > + * varies based on where in the initialization flow failure occurred.
> > + * Simply use the full teardown and destroy, which naturally play nice
> > + * with partial initialization.
> > + */
> > +teardown:
> > + for (; i < tdx_info->nr_tdcs_pages; i++) {
> > + if (tdcs_pa[i]) {
> > + free_page((unsigned long)__va(tdcs_pa[i]));
> > + tdcs_pa[i] = 0;
> > + }
> > + }
> > + if (!kvm_tdx->tdcs_pa)
> > + kfree(tdcs_pa);
>
> The code to "free TDCS pages in a loop and free the array" is done below
> with duplicated code. I am wondering whether we have way to eliminate one.
>
> But I have lost track here, so perhaps we can review again after we split
> the patch to smaller pieces.

Surely we can simplify it. Originally we had a spin lock and I had to separate
blocking memory allocation from its usage with this error clean up path.
Now it's mutex, we mix page allocation with its usage.


> > + tdx_mmu_release_hkid(kvm);
> > + tdx_vm_free(kvm);
> > + return ret;
> > +
> > +free_packages:
> > + cpus_read_unlock();
> > + free_cpumask_var(packages);
> > +free_tdcs:
> > + for (i = 0; i < tdx_info->nr_tdcs_pages; i++) {
> > + if (tdcs_pa[i])
> > + free_page((unsigned long)__va(tdcs_pa[i]));
> > + }
> > + kfree(tdcs_pa);
> > + kvm_tdx->tdcs_pa = NULL;
> > +
> > +free_tdr:
> > + if (tdr_pa)
> > + free_page((unsigned long)__va(tdr_pa));
> > + kvm_tdx->tdr_pa = 0;
> > +free_hkid:
> > + if (is_hkid_assigned(kvm_tdx))
> > + tdx_hkid_free(kvm_tdx);
> > + return ret;
> > +}
> > +
> > int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
> > {
> > struct kvm_tdx_cmd tdx_cmd;
> > @@ -215,12 +664,13 @@ static int tdx_md_read(struct tdx_md_map *maps, int nr_maps)
> > static int __init tdx_module_setup(void)
> > {
> > - u16 num_cpuid_config;
> > + u16 num_cpuid_config, tdcs_base_size;
> > int ret;
> > u32 i;
> > struct tdx_md_map mds[] = {
> > TDX_MD_MAP(NUM_CPUID_CONFIG, &num_cpuid_config),
> > + TDX_MD_MAP(TDCS_BASE_SIZE, &tdcs_base_size),
> > };
> > struct tdx_metadata_field_mapping fields[] = {
> > @@ -273,6 +723,8 @@ static int __init tdx_module_setup(void)
> > c->edx = ecx_edx >> 32;
> > }
> > + tdx_info->nr_tdcs_pages = tdcs_base_size / PAGE_SIZE;
> > +
>
> Round up the 'tdcs_base_size' to make sure you have enough room, or put a
> WARN() here if not page aligned?

Ok, will add round up. Same for tdvps_base_size.
I can't find about those sizes and page size in the TDX spec. Although
TDH.MNG.ADDCX() and TDH.VP.ADDCX() imply that those sizes are multiple of PAGE
SIZE, the spec doesn't guarantee it. I think silent round up is better than
WARN() because we can do nothing about those values the TDX module provides.



> > return 0;
> > error_out:
> > @@ -319,13 +771,27 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
> > struct tdx_enabled enable = {
> > .err = ATOMIC_INIT(0),
> > };
> > + int max_pkgs;
> > int r = 0;
> > + int i;
>
> Nit: you can put the 3 into one line.
>
> > + if (!cpu_feature_enabled(X86_FEATURE_MOVDIR64B)) {
> > + pr_warn("MOVDIR64B is reqiured for TDX\n");
>
> It's better to make it more clear:
>
> "Disable TDX: MOVDIR64B is not supported or disabled by the kernel."
>
> Or, to match below:
>
> "Cannot enable TDX w/o MOVDIR64B".

Ok.


> > + return -EOPNOTSUPP;
> > + }
> > if (!enable_ept) {
> > pr_warn("Cannot enable TDX with EPT disabled\n");
> > return -EINVAL;
> > }
> > + max_pkgs = topology_max_packages();
> > + tdx_mng_key_config_lock = kcalloc(max_pkgs, sizeof(*tdx_mng_key_config_lock),
> > + GFP_KERNEL);
> > + if (!tdx_mng_key_config_lock)
> > + return -ENOMEM;
> > + for (i = 0; i < max_pkgs; i++)
> > + mutex_init(&tdx_mng_key_config_lock[i]);
> > +
>
> Using a per-socket lock looks a little bit overkill to me. I don't know
> whether we need to do in the initial version. Will leave to others.
>
> Please at least add a comment to explain this is for better performance when
> creating multiple TDX guests IIUC?

Will delete the mutex and simply the related logic.
--
Isaku Yamahata <isaku.yamahata@xxxxxxxxx>