Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time

From: Adrian Hunter
Date: Fri Mar 28 2025 - 06:27:03 EST


On 27/03/25 17:54, Sean Christopherson wrote:
> On Thu, Mar 13, 2025, Adrian Hunter wrote:
>> Improve TDX shutdown performance by adding a more efficient shutdown
>> operation at the cost of adding separate branches for the TDX MMU
>> operations for normal runtime and shutdown. This more efficient method was
>> previously used in earlier versions of the TDX patches, but was removed to
>> simplify the initial upstreaming. This is an RFC, and still needs a proper
>> upstream commit log. It is intended to be an eventual follow up to base
>> support.
>
> ...
>
>> == Options ==
>>
>> 1. Start TD teardown earlier so that when pages are removed,
>> they can be reclaimed faster.
>> 2. Defer page removal until after TD teardown has started.
>> 3. A combination of 1 and 2.
>>
>> Option 1 is problematic because it means putting the TD into a non-runnable
>> state while it is potentially still active. Also, as mentioned above, Sean
>> effectively NAK'ed it.
>
> Option 2 is just as gross, arguably even worse. I NAK'd a flavor of option 1,
> not the base concept of initiating teardown before all references to the VM are
> put.
>
> AFAICT, nothing outright prevents adding a TDX sub-ioctl to terminate the VM.
> The locking is a bit heinous, but I would prefer heavy locking to deferring
> reclaim and pinning inodes.
>
> Oh FFS. This is also an opportunity to cleanup RISC-V's insidious copy-paste of
> ARM. Because extracting (un)lock_all_vcpus() to common code would have been sooo
> hard. *sigh*
>
> Very roughly, something like the below (*completely* untested).

Thank you! I will give it a test.

>
> An alternative to taking mmu_lock would be to lock all bound guest_memfds, but I
> think I prefer taking mmu_lock is it's easier to reason about the safety of freeing
> the HKID. Note, the truncation phase of a PUNCH_HOLE could still run in parallel,
> but that's a-ok. The only part of PUNCH_HOLE that needs to be blocked is the call
> to kvm_mmu_unmap_gfn_range().
>
> ---
> arch/x86/kvm/vmx/tdx.c | 61 ++++++++++++++++++++++++++++++------------
> 1 file changed, 44 insertions(+), 17 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 87f188021cbd..6fb595c272ab 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -472,7 +472,7 @@ static void smp_func_do_phymem_cache_wb(void *unused)
> pr_tdx_error(TDH_PHYMEM_CACHE_WB, err);
> }
>
> -void tdx_mmu_release_hkid(struct kvm *kvm)
> +static void __tdx_release_hkid(struct kvm *kvm, bool terminate)
> {
> bool packages_allocated, targets_allocated;
> struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> @@ -485,10 +485,11 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
> if (!is_hkid_assigned(kvm_tdx))
> return;
>
> + if (KVM_BUG_ON(refcount_read(&kvm->users_count) && !terminate))
> + return;
> +
> packages_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
> targets_allocated = zalloc_cpumask_var(&targets, GFP_KERNEL);
> - cpus_read_lock();
> -
> kvm_for_each_vcpu(j, vcpu, kvm)
> tdx_flush_vp_on_cpu(vcpu);
>
> @@ -500,12 +501,8 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
> */
> mutex_lock(&tdx_lock);
>
> - /*
> - * Releasing HKID is in vm_destroy().
> - * After the above flushing vps, there should be no more vCPU
> - * associations, as all vCPU fds have been released at this stage.
> - */
> err = tdh_mng_vpflushdone(&kvm_tdx->td);
> + /* Uh, what's going on here? */
> if (err == TDX_FLUSHVP_NOT_DONE)
> goto out;

Looks like it should not be needed.

First introduced here:

[PATCH v16 071/116] KVM: TDX: handle vcpu migration over logical processor
https://lore.kernel.org/af6cb5e6abae5a19d6d2eeb452d29a96233e5a44.1697471314.git.isaku.yamahata@xxxxxxxxx

Explained a bit

Re: [PATCH v17 071/116] KVM: TDX: handle vcpu migration over logical processor
https://lore.kernel.org/20231117080804.GF1277973@xxxxxxxxxxxxxxxxxxxxx

And explained a bit more:

Re: [PATCH v19 087/130] KVM: TDX: handle vcpu migration over logical processor
https://lore.kernel.org/all/20240415224828.GS3039520@xxxxxxxxxxxxxxxxxxxxx/

> if (KVM_BUG_ON(err, kvm)) {
> @@ -515,6 +512,7 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
> goto out;
> }
>
> + write_lock(&kvm->mmu_lock);
> for_each_online_cpu(i) {
> if (packages_allocated &&
> cpumask_test_and_set_cpu(topology_physical_package_id(i),
> @@ -539,14 +537,21 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
> } else {
> tdx_hkid_free(kvm_tdx);
> }
> -
> + write_unlock(&kvm->mmu_lock);
> out:
> mutex_unlock(&tdx_lock);
> - cpus_read_unlock();
> free_cpumask_var(targets);
> free_cpumask_var(packages);
> }
>
> +void tdx_mmu_release_hkid(struct kvm *kvm)
> +{
> + cpus_read_lock();
> + __tdx_release_hkid(kvm, false);
> + cpus_read_unlock();
> +}
> +
> +
> static void tdx_reclaim_td_control_pages(struct kvm *kvm)
> {
> struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> @@ -1789,13 +1794,10 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
> struct page *page = pfn_to_page(pfn);
> int ret;
>
> - /*
> - * HKID is released after all private pages have been removed, and set
> - * before any might be populated. Warn if zapping is attempted when
> - * there can't be anything populated in the private EPT.
> - */
> - if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
> - return -EINVAL;
> + if (!is_hkid_assigned(to_kvm_tdx(kvm)), kvm) {
> + WARN_ON_ONCE(!kvm->vm_dead);
> + return tdx_reclaim_page(pfn_to_page(pfn));
> + }
>
> ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
> if (ret <= 0)
> @@ -2790,6 +2792,28 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
> return 0;
> }
>
> +static int tdx_td_terminate(struct kvm *kvm)
> +{
> + struct kvm_memory_slot *slot;
> + struct kvm_memslots *slots;
> + int bkt;
> +
> + cpus_read_lock();
> + guard(mutex)(&kvm->lock);
> +
> + r = kvm_lock_all_vcpus();
> + if (r)
> + goto out;
> +
> + kvm_vm_dead(kvm);
> + kvm_unlock_all_vcpus();
> +
> + __tdx_release_hkid(kvm);
> +out:
> + cpus_read_unlock();
> + return r;
> +}
> +
> int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
> {
> struct kvm_tdx_cmd tdx_cmd;
> @@ -2805,6 +2829,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
> if (tdx_cmd.hw_error)
> return -EINVAL;
>
> + if (tdx_cmd.id == KVM_TDX_TERMINATE_VM)
> + return tdx_td_terminate(kvm);
> +
> mutex_lock(&kvm->lock);
>
> switch (tdx_cmd.id) {
>
> base-commit: 2156c3c7d60c5be9c0d9ab1fedccffe3c55a2ca0