[PATCH 00/12] KVM: MMU: do not unload MMU roots on all role changes

From: Paolo Bonzini
Date: Wed Feb 09 2022 - 12:00:40 EST


The TDP MMU has a performance regression compared to the legacy MMU
when CR0 changes often. This was reported for the grsecurity kernel,
which uses CR0.WP to implement kernel W^X. In that case, each change to
CR0.WP unloads the MMU and causes a lot of unnecessary work. When running
nested, this can even cause the L1 to hardly make progress, as the L0
hypervisor it is overwhelmed by the amount of MMU work that is needed.

Initially, my plan for this was to pull kvm_mmu_unload from
kvm_mmu_reset_context into kvm_init_mmu. Therefore I started by separating
the CPU setup (CR0/CR4/EFER, SMM, guest mode, etc.) from the shadow
page table format. Right now the "MMU role" is a messy mix of the two
and, whenever something is different between the MMU and the CPU, it is
stored as an extra field in struct kvm_mmu; for extra bonus complication,
sometimes the same thing is stored in both the role and an extra field.
The aim was to keep kvm_mmu_unload only if the MMU role changed, and
drop it if the CPU role changed.

I even posted that cleanup, but it occurred to me later that even
a conditional kvm_mmu_unload in kvm_init_mmu would be overkill.
kvm_mmu_unload is only needed in the rare cases where a TLB flush is
needed (e.g. CR0.PG changing from 1 to 0) or where the guest page table
interpretation changes in way not captured by the role (that is, CPUID
changes). But the implementation of fast PGD switching is subtle
and requires a call to kvm_mmu_new_pgd (and therefore knowing the
new MMU role) before kvm_init_mmu, therefore kvm_mmu_reset_context
chickens and drops all the roots.

Therefore, the meat of this series is a reorganization of fast PGD
switching; it makes it possible to call kvm_mmu_new_pgd *after*
the MMU has been set up, just using the MMU role instead of
kvm_mmu_calc_root_page_role.

Patches 1 to 3 are bugfixes found while working on the series.

Patches 4 to 5 add more sanity checks that triggered a lot during
development.

Patches 6 and 7 are related cleanups. In particular patch 7 makes
the cache lookup code a bit more pleasant.

Patches 8 to 9 rework the fast PGD switching. Patches 10 and
11 are cleanups enabled by the rework, and the only survivors
of the CPU role patchset.

Finally, patch 12 optimizes kvm_mmu_reset_context.

Paolo


Paolo Bonzini (12):
KVM: x86: host-initiated EFER.LME write affects the MMU
KVM: MMU: move MMU role accessors to header
KVM: x86: do not deliver asynchronous page faults if CR0.PG=0
KVM: MMU: WARN if PAE roots linger after kvm_mmu_unload
KVM: MMU: avoid NULL-pointer dereference on page freeing bugs
KVM: MMU: rename kvm_mmu_reload
KVM: x86: use struct kvm_mmu_root_info for mmu->root
KVM: MMU: do not consult levels when freeing roots
KVM: MMU: look for a cached PGD when going from 32-bit to 64-bit
KVM: MMU: load new PGD after the shadow MMU is initialized
KVM: MMU: remove kvm_mmu_calc_root_page_role
KVM: x86: do not unload MMU roots on all role changes

arch/x86/include/asm/kvm_host.h | 3 +-
arch/x86/kvm/mmu.h | 28 +++-
arch/x86/kvm/mmu/mmu.c | 253 ++++++++++++++++----------------
arch/x86/kvm/mmu/mmu_audit.c | 4 +-
arch/x86/kvm/mmu/paging_tmpl.h | 2 +-
arch/x86/kvm/mmu/tdp_mmu.c | 2 +-
arch/x86/kvm/mmu/tdp_mmu.h | 2 +-
arch/x86/kvm/svm/nested.c | 6 +-
arch/x86/kvm/vmx/nested.c | 8 +-
arch/x86/kvm/vmx/vmx.c | 2 +-
arch/x86/kvm/x86.c | 39 +++--
11 files changed, 190 insertions(+), 159 deletions(-)

--
2.31.1