Re: [PATCH 3/4] KVM: x86/mmu: Shrink pte_list_desc size when KVM is using TDP

From: Sean Christopherson
Date: Tue Jul 12 2022 - 18:53:59 EST


On Tue, Jul 12, 2022, Peter Xu wrote:
> On Fri, Jun 24, 2022 at 11:27:34PM +0000, Sean Christopherson wrote:
> > Dynamically size struct pte_list_desc's array of sptes based on whether
> > or not KVM is using TDP. Commit dc1cff969101 ("KVM: X86: MMU: Tune
> > PTE_LIST_EXT to be bigger") bumped the number of entries in order to
> > improve performance when using shadow paging, but its analysis that the
> > larger size would not affect TDP was wrong. Consuming pte_list_desc
> > objects for nested TDP is indeed rare, but _allocating_ objects is not,
> > as KVM allocates 40 objects for each per-vCPU cache. Reducing the size
> > from 128 bytes to 32 bytes reduces that per-vCPU cost from 5120 bytes to
> > 1280, and also provides similar savings when eager page splitting for
> > nested MMUs kicks in.
> >
> > The per-vCPU overhead could be further reduced by using a custom, smaller
> > capacity for the per-vCPU caches, but that's more of an "and" than
> > an "or" change, e.g. it wouldn't help the eager page split use case.
> >
> > Set the list size to the bare minimum without completely defeating the
> > purpose of an array (and because pte_list_add() assumes the array is at
> > least two entries deep). A larger size, e.g. 4, would reduce the number
> > of "allocations", but those "allocations" only become allocations in
> > truth if a single vCPU depletes its cache to where a topup is needed,
> > i.e. if a single vCPU "allocates" 30+ lists. Conversely, those 2 extra
> > entries consume 16 bytes * 40 * nr_vcpus in the caches the instant nested
> > TDP is used.
> >
> > In the unlikely event that performance of aliased gfns for nested TDP
> > really is (or becomes) a priority for oddball workloads, KVM could add a
> > knob to let the admin tune the array size for their environment.
> >
> > Note, KVM also unnecessarily tops up the per-vCPU caches even when not
> > using rmaps; this can also be addressed separately.
>
> The only possible way of using pte_list_desc when tdp=1 is when the
> hypervisor tries to map the same host pages with different GPAs?

Yes, if by "host pages" you mean L1 GPAs. It happens if the L1 VMM maps multiple
L2 GFNs to a single L1 GFN, in which case KVM's nTDP shadow MMU needs to rmap
that single L1 GFN to multiple L2 GFNs.

> And we don't really have a real use case of that, or.. do we?

QEMU does it during boot/pre-boot when BIOS remaps the flash region into the lower
1mb, i.e. aliases high GPAs to low GPAs.

> Sorry to start with asking questions, it's just that if we know that
> pte_list_desc is probably not gonna be used then could we simply skip the
> cache layer as a whole? IOW, we don't make the "array size of pte list
> desc" dynamic, instead we make the whole "pte list desc cache layer"
> dynamic. Is it possible?

Not really? It's theoretically possible, but it'd require pre-checking that aren't
aliases, and to do that race free we'd have to do it under mmu_lock, which means
having to support bailing from the page fault to topup the cache. The memory
overhead for the cache isn't so significant that it's worth that level of complexity.