Re: [PATCH v3 22/28] KVM: x86/mmu: Zap defunct roots via asynchronous worker
From: Ben Gardon
Date: Tue Mar 01 2022 - 12:57:28 EST
On Fri, Feb 25, 2022 at 4:16 PM Sean Christopherson <seanjc@xxxxxxxxxx> wrote:
>
> Zap defunct roots, a.k.a. roots that have been invalidated after their
> last reference was initially dropped, asynchronously via the system work
> queue instead of forcing the work upon the unfortunate task that happened
> to drop the last reference.
>
> If a vCPU task drops the last reference, the vCPU is effectively blocked
> by the host for the entire duration of the zap. If the root being zapped
> happens be fully populated with 4kb leaf SPTEs, e.g. due to dirty logging
> being active, the zap can take several hundred seconds. Unsurprisingly,
> most guests are unhappy if a vCPU disappears for hundreds of seconds.
>
> E.g. running a synthetic selftest that triggers a vCPU root zap with
> ~64tb of guest memory and 4kb SPTEs blocks the vCPU for 900+ seconds.
> Offloading the zap to a worker drops the block time to <100ms.
>
> Signed-off-by: Sean Christopherson <seanjc@xxxxxxxxxx>
Reviewed-by: Ben Gardon <bgardon@xxxxxxxxxx>
> ---
> arch/x86/kvm/mmu/mmu_internal.h | 8 +++-
> arch/x86/kvm/mmu/tdp_mmu.c | 65 ++++++++++++++++++++++++++++-----
> 2 files changed, 63 insertions(+), 10 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index be063b6c91b7..1bff453f7cbe 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -65,7 +65,13 @@ struct kvm_mmu_page {
> struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
> tdp_ptep_t ptep;
> };
> - DECLARE_BITMAP(unsync_child_bitmap, 512);
> + union {
> + DECLARE_BITMAP(unsync_child_bitmap, 512);
> + struct {
> + struct work_struct tdp_mmu_async_work;
> + void *tdp_mmu_async_data;
> + };
> + };
At some point (probably not in this series since it's so long already)
it would be good to organize kvm_mmu_page. It looks like we've got
quite a few anonymous unions in there for TDP / Shadow MMU fields.
>
> struct list_head lpage_disallowed_link;
> #ifdef CONFIG_X86_32
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index ec28a88c6376..4151e61245a7 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -81,6 +81,38 @@ static void tdp_mmu_free_sp_rcu_callback(struct rcu_head *head)
> static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
> bool shared);
>
> +static void tdp_mmu_zap_root_async(struct work_struct *work)
> +{
> + struct kvm_mmu_page *root = container_of(work, struct kvm_mmu_page,
> + tdp_mmu_async_work);
> + struct kvm *kvm = root->tdp_mmu_async_data;
> +
> + read_lock(&kvm->mmu_lock);
> +
> + /*
> + * A TLB flush is not necessary as KVM performs a local TLB flush when
> + * allocating a new root (see kvm_mmu_load()), and when migrating vCPU
> + * to a different pCPU. Note, the local TLB flush on reuse also
> + * invalidates any paging-structure-cache entries, i.e. TLB entries for
> + * intermediate paging structures, that may be zapped, as such entries
> + * are associated with the ASID on both VMX and SVM.
> + */
> + tdp_mmu_zap_root(kvm, root, true);
> +
> + /*
> + * Drop the refcount using kvm_tdp_mmu_put_root() to test its logic for
> + * avoiding an infinite loop. By design, the root is reachable while
> + * it's being asynchronously zapped, thus a different task can put its
> + * last reference, i.e. flowing through kvm_tdp_mmu_put_root() for an
> + * asynchronously zapped root is unavoidable.
> + */
> + kvm_tdp_mmu_put_root(kvm, root, true);
> +
> + read_unlock(&kvm->mmu_lock);
> +
> + kvm_put_kvm(kvm);
> +}
> +
> void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> bool shared)
> {
> @@ -142,15 +174,26 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> refcount_set(&root->tdp_mmu_root_count, 1);
>
> /*
> - * Zap the root, then put the refcount "acquired" above. Recursively
> - * call kvm_tdp_mmu_put_root() to test the above logic for avoiding an
> - * infinite loop by freeing invalid roots. By design, the root is
> - * reachable while it's being zapped, thus a different task can put its
> - * last reference, i.e. flowing through kvm_tdp_mmu_put_root() for a
> - * defunct root is unavoidable.
> + * Attempt to acquire a reference to KVM itself. If KVM is alive, then
> + * zap the root asynchronously in a worker, otherwise it must be zapped
> + * directly here. Wait to do this check until after the refcount is
> + * reset so that tdp_mmu_zap_root() can safely yield.
> + *
> + * In both flows, zap the root, then put the refcount "acquired" above.
> + * When putting the reference, use kvm_tdp_mmu_put_root() to test the
> + * above logic for avoiding an infinite loop by freeing invalid roots.
> + * By design, the root is reachable while it's being zapped, thus a
> + * different task can put its last reference, i.e. flowing through
> + * kvm_tdp_mmu_put_root() for a defunct root is unavoidable.
> */
> - tdp_mmu_zap_root(kvm, root, shared);
> - kvm_tdp_mmu_put_root(kvm, root, shared);
> + if (kvm_get_kvm_safe(kvm)) {
> + root->tdp_mmu_async_data = kvm;
> + INIT_WORK(&root->tdp_mmu_async_work, tdp_mmu_zap_root_async);
> + schedule_work(&root->tdp_mmu_async_work);
> + } else {
> + tdp_mmu_zap_root(kvm, root, shared);
> + kvm_tdp_mmu_put_root(kvm, root, shared);
> + }
> }
>
> enum tdp_mmu_roots_iter_type {
> @@ -954,7 +997,11 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
>
> /*
> * Zap all roots, including invalid roots, as all SPTEs must be dropped
> - * before returning to the caller.
> + * before returning to the caller. Zap directly even if the root is
> + * also being zapped by a worker. Walking zapped top-level SPTEs isn't
> + * all that expensive and mmu_lock is already held, which means the
> + * worker has yielded, i.e. flushing the work instead of zapping here
> + * isn't guaranteed to be any faster.
> *
> * A TLB flush is unnecessary, KVM zaps everything if and only the VM
> * is being destroyed or the userspace VMM has exited. In both cases,
> --
> 2.35.1.574.g5d30c73bfb-goog
>