Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time

From: Sean Christopherson
Date: Thu Mar 27 2025 - 11:55:05 EST

Next message: Steven Rostedt: "[for-next][PATCH] tracing: Use _text and the kernel offset in last_boot_info"
Previous message: Doug Anderson: "Re: [PATCH v2 1/4] dt-bindings: display: panel: samsung,atna40yk20: document ATNA40YK20"
In reply to: Adrian Hunter: "Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time"
Next in thread: Adrian Hunter: "Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Mar 13, 2025, Adrian Hunter wrote:
> Improve TDX shutdown performance by adding a more efficient shutdown
> operation at the cost of adding separate branches for the TDX MMU
> operations for normal runtime and shutdown. This more efficient method was
> previously used in earlier versions of the TDX patches, but was removed to
> simplify the initial upstreaming. This is an RFC, and still needs a proper
> upstream commit log. It is intended to be an eventual follow up to base
> support.

...

> == Options ==
>
> 1. Start TD teardown earlier so that when pages are removed,
> they can be reclaimed faster.
> 2. Defer page removal until after TD teardown has started.
> 3. A combination of 1 and 2.
>
> Option 1 is problematic because it means putting the TD into a non-runnable
> state while it is potentially still active. Also, as mentioned above, Sean
> effectively NAK'ed it.

Option 2 is just as gross, arguably even worse. I NAK'd a flavor of option 1,
not the base concept of initiating teardown before all references to the VM are
put.

AFAICT, nothing outright prevents adding a TDX sub-ioctl to terminate the VM.
The locking is a bit heinous, but I would prefer heavy locking to deferring
reclaim and pinning inodes.

Oh FFS. This is also an opportunity to cleanup RISC-V's insidious copy-paste of
ARM. Because extracting (un)lock_all_vcpus() to common code would have been sooo
hard. *sigh*

Very roughly, something like the below (*completely* untested).

An alternative to taking mmu_lock would be to lock all bound guest_memfds, but I
think I prefer taking mmu_lock is it's easier to reason about the safety of freeing
the HKID. Note, the truncation phase of a PUNCH_HOLE could still run in parallel,
but that's a-ok. The only part of PUNCH_HOLE that needs to be blocked is the call
to kvm_mmu_unmap_gfn_range().

---
arch/x86/kvm/vmx/tdx.c | 61 ++++++++++++++++++++++++++++++------------
1 file changed, 44 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 87f188021cbd..6fb595c272ab 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -472,7 +472,7 @@ static void smp_func_do_phymem_cache_wb(void *unused)
pr_tdx_error(TDH_PHYMEM_CACHE_WB, err);
}

-void tdx_mmu_release_hkid(struct kvm *kvm)
+static void __tdx_release_hkid(struct kvm *kvm, bool terminate)
{
bool packages_allocated, targets_allocated;
struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
@@ -485,10 +485,11 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
if (!is_hkid_assigned(kvm_tdx))
return;

+ if (KVM_BUG_ON(refcount_read(&kvm->users_count) && !terminate))
+ return;
+
packages_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
targets_allocated = zalloc_cpumask_var(&targets, GFP_KERNEL);
- cpus_read_lock();
-
kvm_for_each_vcpu(j, vcpu, kvm)
tdx_flush_vp_on_cpu(vcpu);

@@ -500,12 +501,8 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
*/
mutex_lock(&tdx_lock);

- /*
- * Releasing HKID is in vm_destroy().
- * After the above flushing vps, there should be no more vCPU
- * associations, as all vCPU fds have been released at this stage.
- */
err = tdh_mng_vpflushdone(&kvm_tdx->td);
+ /* Uh, what's going on here? */
if (err == TDX_FLUSHVP_NOT_DONE)
goto out;
if (KVM_BUG_ON(err, kvm)) {
@@ -515,6 +512,7 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
goto out;
}

+ write_lock(&kvm->mmu_lock);
for_each_online_cpu(i) {
if (packages_allocated &&
cpumask_test_and_set_cpu(topology_physical_package_id(i),
@@ -539,14 +537,21 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
} else {
tdx_hkid_free(kvm_tdx);
}
-
+ write_unlock(&kvm->mmu_lock);
out:
mutex_unlock(&tdx_lock);
- cpus_read_unlock();
free_cpumask_var(targets);
free_cpumask_var(packages);
}

+void tdx_mmu_release_hkid(struct kvm *kvm)
+{
+ cpus_read_lock();
+ __tdx_release_hkid(kvm, false);
+ cpus_read_unlock();
+}
+
+
static void tdx_reclaim_td_control_pages(struct kvm *kvm)
{
struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
@@ -1789,13 +1794,10 @@ int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
struct page *page = pfn_to_page(pfn);
int ret;

- /*
- * HKID is released after all private pages have been removed, and set
- * before any might be populated. Warn if zapping is attempted when
- * there can't be anything populated in the private EPT.
- */
- if (KVM_BUG_ON(!is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
- return -EINVAL;
+ if (!is_hkid_assigned(to_kvm_tdx(kvm)), kvm) {
+ WARN_ON_ONCE(!kvm->vm_dead);
+ return tdx_reclaim_page(pfn_to_page(pfn));
+ }

ret = tdx_sept_zap_private_spte(kvm, gfn, level, page);
if (ret <= 0)
@@ -2790,6 +2792,28 @@ static int tdx_td_finalize(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
return 0;
}

+static int tdx_td_terminate(struct kvm *kvm)
+{
+ struct kvm_memory_slot *slot;
+ struct kvm_memslots *slots;
+ int bkt;
+
+ cpus_read_lock();
+ guard(mutex)(&kvm->lock);
+
+ r = kvm_lock_all_vcpus();
+ if (r)
+ goto out;
+
+ kvm_vm_dead(kvm);
+ kvm_unlock_all_vcpus();
+
+ __tdx_release_hkid(kvm);
+out:
+ cpus_read_unlock();
+ return r;
+}
+
int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
{
struct kvm_tdx_cmd tdx_cmd;
@@ -2805,6 +2829,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
if (tdx_cmd.hw_error)
return -EINVAL;

+ if (tdx_cmd.id == KVM_TDX_TERMINATE_VM)
+ return tdx_td_terminate(kvm);
+
mutex_lock(&kvm->lock);

switch (tdx_cmd.id) {

base-commit: 2156c3c7d60c5be9c0d9ab1fedccffe3c55a2ca0
--

Next message: Steven Rostedt: "[for-next][PATCH] tracing: Use _text and the kernel offset in last_boot_info"
Previous message: Doug Anderson: "Re: [PATCH v2 1/4] dt-bindings: display: panel: samsung,atna40yk20: document ATNA40YK20"
In reply to: Adrian Hunter: "Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time"
Next in thread: Adrian Hunter: "Re: [PATCH RFC] KVM: TDX: Defer guest memory removal to decrease shutdown time"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]