[PATCH 0/8] KVM: x86/mmu: Allow TDP MMU (un)load to run in parallel

From: Sean Christopherson
Date: Wed Jan 10 2024 - 21:01:09 EST


This series is the result of digging into why deleting a memslot, which on
x86 forces all vCPUs to reload a new MMU root, causes noticeably more jitter
in vCPUs and other tasks when running with the TDP MMU than the Shadow MMU
(with TDP enabled).

Patch 1 addresses the most obvious issue by simply zapping at a finer
granularity so that if a different task, e.g. a vCPU, wants to run on the
pCPU doing the zapping, it doesn't have to wait for KVM to zap an entire
1GiB region, which can take a hundreds of microseconds (or more). The
shadow MMU checks for need_resched() (and mmu_lock contention, see below)
every 10 zaps, which is why the shadow MMU doesn't induce the same level
of jitter.

On preemptible kernels, zapping at 4KiB granularity will also cause the
zapping task to yield mmu_lock much more aggressively if a writer comes
along. That _sounds_ like a good thing, and most of the time it is, but
sometimes bouncing mmu_lock can be a big net negative:
https://lore.kernel.org/all/20240110012045.505046-1-seanjc@xxxxxxxxxx

While trying to figure out whether or not frequently yielding mmu_lock
would be a negative or positive, I ran into extremely high latencies for
loading TDP MMU roots on VMs with large-ish numbers of vCPUs, e.g. a vCPU
could end up taking more than a second to

Long story short, the issue is that the TDP MMU acquires mmu_lock for
write when unloading roots, and again when loading a "new" root (in quotes
because most vCPUs end up loading an existing root). With a decent number
of vCPUs, that results in a _lot_ mmu_lock contention, as every vCPU will
take and release mmu_lock for write to unload its roots, and then again to
load a new root. Due to rwlock's fairness (waiting writers block new
readers), the contention can result in rather nasty worst case scenarios.

Patches 6-8 fix the issues by taking mmu_lock for read. The free path is
very straightforward and doesn't require any new protection (IIRC, the only
reason we didn't pursue this when reworking the TDP MMU zapping back at the
end of 2021 was because we had bigger issues to solve). Allocating a new
root with mmu_lock held for read is a little harder, but still fairly easy.
KVM only needs to ensure that it doesn't create duplicate roots, because
everything that needs mmu_lock to ensure ordering must take mmu_lock for
write, i.e. is still mutually exclusive with new roots coming along.

Patches 2-5 are small cleanups to avoid doing work for invalid roots, e.g.
when zapping SPTEs purely to affect guest behavior, there's no need to zap
invalid roots because they are unreachable from the guest.

All told, this significantly reduces mmu_lock contention when doing a fast
zap, i.e. when deleting memslots, and takes the worst case latency for a
vCPU to load a new root from >3ms to <100us for large-ish VMs (100+ vCPUs)
For small and medium sized VMs (<24 vCPUs), the vast majority of loads
takes less than 1us, with the worst case being <10us, versus >200us without
this series.

Note, I did all of the latency testing before the holidays, and then
managed to lose almost all of my notes, which is why I don't have more
precise data on the exact setups and latency bins. /facepalm

Sean Christopherson (8):
KVM: x86/mmu: Zap invalidated TDP MMU roots at 4KiB granularity
KVM: x86/mmu: Don't do TLB flush when zappings SPTEs in invalid roots
KVM: x86/mmu: Allow passing '-1' for "all" as_id for TDP MMU iterators
KVM: x86/mmu: Skip invalid roots when zapping leaf SPTEs for GFN range
KVM: x86/mmu: Skip invalid TDP MMU roots when write-protecting SPTEs
KVM: x86/mmu: Check for usable TDP MMU root while holding mmu_lock for
read
KVM: x86/mmu: Alloc TDP MMU roots while holding mmu_lock for read
KVM: x86/mmu: Free TDP MMU roots while holding mmy_lock for read

arch/x86/kvm/mmu/mmu.c | 33 +++++++---
arch/x86/kvm/mmu/tdp_mmu.c | 124 ++++++++++++++++++++++++++-----------
arch/x86/kvm/mmu/tdp_mmu.h | 2 +-
3 files changed, 111 insertions(+), 48 deletions(-)


base-commit: 1c6d984f523f67ecfad1083bb04c55d91977bb15
--
2.43.0.275.g3460e3d667-goog