Re: [PATCH 00/22] Introduce the TDP MMU

From: Paolo Bonzini
Date: Fri Sep 25 2020 - 21:14:13 EST


On 25/09/20 23:22, Ben Gardon wrote:
> Over the years, the needs for KVM's x86 MMU have grown from running small
> guests to live migrating multi-terabyte VMs with hundreds of vCPUs. Where
> we previously depended on shadow paging to run all guests, we now have
> two dimensional paging (TDP). This patch set introduces a new
> implementation of much of the KVM MMU, optimized for running guests with
> TDP. We have re-implemented many of the MMU functions to take advantage of
> the relative simplicity of TDP and eliminate the need for an rmap.
> Building on this simplified implementation, a future patch set will change
> the synchronization model for this "TDP MMU" to enable more parallelism
> than the monolithic MMU lock. A TDP MMU is currently in use at Google
> and has given us the performance necessary to live migrate our 416 vCPU,
> 12TiB m2-ultramem-416 VMs.
>
> This work was motivated by the need to handle page faults in parallel for
> very large VMs. When VMs have hundreds of vCPUs and terabytes of memory,
> KVM's MMU lock suffers extreme contention, resulting in soft-lockups and
> long latency on guest page faults. This contention can be easily seen
> running the KVM selftests demand_paging_test with a couple hundred vCPUs.
> Over a 1 second profile of the demand_paging_test, with 416 vCPUs and 4G
> per vCPU, 98% of the time was spent waiting for the MMU lock. At Google,
> the TDP MMU reduced the test duration by 89% and the execution was
> dominated by get_user_pages and the user fault FD ioctl instead of the
> MMU lock.
>
> This series is the first of two. In this series we add a basic
> implementation of the TDP MMU. In the next series we will improve the
> performance of the TDP MMU and allow it to execute MMU operations
> in parallel.
>
> The overall purpose of the KVM MMU is to program paging structures
> (CR3/EPT/NPT) to encode the mapping of guest addresses to host physical
> addresses (HPA), and to provide utilities for other KVM features, for
> example dirty logging. The definition of the L1 guest physical address
> (GPA) to HPA mapping comes in two parts: KVM's memslots map GPA to HVA,
> and the kernel MM/x86 host page tables map HVA -> HPA. Without TDP, the
> MMU must program the x86 page tables to encode the full translation of
> guest virtual addresses (GVA) to HPA. This requires "shadowing" the
> guest's page tables to create a composite x86 paging structure. This
> solution is complicated, requires separate paging structures for each
> guest CR3, and requires emulating guest page table changes. The TDP case
> is much simpler. In this case, KVM lets the guest control CR3 and programs
> the EPT/NPT paging structures with the GPA -> HPA mapping. The guest has
> no way to change this mapping and only one version of the paging structure
> is needed per L1 paging mode. In this case the paging mode is some
> combination of the number of levels in the paging structure, the address
> space (normal execution or system management mode, on x86), and other
> attributes. Most VMs only ever use 1 paging mode and so only ever need one
> TDP structure.
>
> This series implements a "TDP MMU" through alternative implementations of
> MMU functions for running L1 guests with TDP. The TDP MMU falls back to
> the existing shadow paging implementation when TDP is not available, and
> interoperates with the existing shadow paging implementation for nesting.
> The use of the TDP MMU can be controlled by a module parameter which is
> snapshot on VM creation and follows the life of the VM. This snapshot
> is used in many functions to decide whether or not to use TDP MMU handlers
> for a given operation.
>
> This series can also be viewed in Gerrit here:
> https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538
> (Thanks to Dmitry Vyukov <dvyukov@xxxxxxxxxx> for setting up the
> Gerrit instance)
>
> Ben Gardon (22):
> kvm: mmu: Separate making SPTEs from set_spte
> kvm: mmu: Introduce tdp_iter
> kvm: mmu: Init / Uninit the TDP MMU
> kvm: mmu: Allocate and free TDP MMU roots
> kvm: mmu: Add functions to handle changed TDP SPTEs
> kvm: mmu: Make address space ID a property of memslots
> kvm: mmu: Support zapping SPTEs in the TDP MMU
> kvm: mmu: Separate making non-leaf sptes from link_shadow_page
> kvm: mmu: Remove disallowed_hugepage_adjust shadow_walk_iterator arg
> kvm: mmu: Add TDP MMU PF handler
> kvm: mmu: Factor out allocating a new tdp_mmu_page
> kvm: mmu: Allocate struct kvm_mmu_pages for all pages in TDP MMU
> kvm: mmu: Support invalidate range MMU notifier for TDP MMU
> kvm: mmu: Add access tracking for tdp_mmu
> kvm: mmu: Support changed pte notifier in tdp MMU
> kvm: mmu: Add dirty logging handler for changed sptes
> kvm: mmu: Support dirty logging for the TDP MMU
> kvm: mmu: Support disabling dirty logging for the tdp MMU
> kvm: mmu: Support write protection for nesting in tdp MMU
> kvm: mmu: NX largepage recovery for TDP MMU
> kvm: mmu: Support MMIO in the TDP MMU
> kvm: mmu: Don't clear write flooding count for direct roots
>
> arch/x86/include/asm/kvm_host.h | 17 +
> arch/x86/kvm/Makefile | 3 +-
> arch/x86/kvm/mmu/mmu.c | 437 ++++++----
> arch/x86/kvm/mmu/mmu_internal.h | 98 +++
> arch/x86/kvm/mmu/paging_tmpl.h | 3 +-
> arch/x86/kvm/mmu/tdp_iter.c | 198 +++++
> arch/x86/kvm/mmu/tdp_iter.h | 55 ++
> arch/x86/kvm/mmu/tdp_mmu.c | 1315 +++++++++++++++++++++++++++++++
> arch/x86/kvm/mmu/tdp_mmu.h | 52 ++
> include/linux/kvm_host.h | 2 +
> virt/kvm/kvm_main.c | 7 +-
> 11 files changed, 2022 insertions(+), 165 deletions(-)
> create mode 100644 arch/x86/kvm/mmu/tdp_iter.c
> create mode 100644 arch/x86/kvm/mmu/tdp_iter.h
> create mode 100644 arch/x86/kvm/mmu/tdp_mmu.c
> create mode 100644 arch/x86/kvm/mmu/tdp_mmu.h
>

Ok, I've not finished reading the code but I have already an idea of
what it's like. I really think we should fast track this as the basis
for more 5.11 work. I'll finish reviewing it and, if you don't mind, I
might make some of the changes myself so I have the occasion to play and
get accustomed to the code; speak up if you disagree with them though!
Another thing I'd like to add is a few tracepoints.

Paolo