Re: [Patch v3 5/9] KVM: x86/mmu: Allocate TDP page table's page on correct NUMA node on split
From: Vipin Sharma
Date: Wed Dec 28 2022 - 17:08:41 EST
On Tue, Dec 27, 2022 at 11:02 AM Ben Gardon <bgardon@xxxxxxxxxx> wrote:
>
> On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <vipinsh@xxxxxxxxxx> wrote:
> >
> > When dirty log is enabled, huge pages are split. Page table's pages
>
> Nit: Suggest "When huge pages are split for dirty log" since this can
> happen at various points during dirty logging.
> Same below.
>
Yeah, this should be updated.
> > during the split are allocated based on the current thread NUMA node or
> > mempolicy. This causes inefficient page table accesses if underlying
> > page is on a different NUMA node
> >
> > Allocate page table's pages on the same NUMA node as the underlying huge
> > page when dirty log is enabled and huge pages are split.
> >
> > The performance gain during the pre-copy phase of live migrations of a
> > 416 vCPUs and 11 TiB memory VM on a 8 node host was seen in the range
> > of 130% to 150%.
> >
> > Suggested-by: David Matlack <dmatlack@xxxxxxxxxx>
> > Signed-off-by: Vipin Sharma <vipinsh@xxxxxxxxxx>
> > ---
> > arch/x86/kvm/mmu/tdp_mmu.c | 12 ++++++++----
> > include/linux/kvm_host.h | 18 ++++++++++++++++++
> > 2 files changed, 26 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 4974fa96deff..376b8dceb3f9 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1403,7 +1403,7 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
> > return spte_set;
> > }
> >
> > -static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
> > +static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(int nid, gfp_t gfp)
> > {
> > struct kvm_mmu_page *sp;
> >
> > @@ -1413,7 +1413,8 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
> > if (!sp)
> > return NULL;
> >
> > - sp->spt = (void *)__get_free_page(gfp);
> > + sp->spt = kvm_mmu_get_free_page(nid, gfp);
> > +
>
> Just so that kvm_mmu_get_free_page isn't dead code in the previous
> commit, I'd do this refactor there and just pass NUMA_NO_NODE here.
>
Agreed.
> > if (!sp->spt) {
> > kmem_cache_free(mmu_page_header_cache, sp);
> > return NULL;
> > @@ -1427,6 +1428,9 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
> > bool shared)
> > {
> > struct kvm_mmu_page *sp;
> > + int nid;
> > +
> > + nid = kvm_pfn_to_page_table_nid(spte_to_pfn(iter->old_spte));
> >
> > /*
> > * Since we are allocating while under the MMU lock we have to be
> > @@ -1437,7 +1441,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
> > * If this allocation fails we drop the lock and retry with reclaim
> > * allowed.
> > */
> > - sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT);
> > + sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_NOWAIT | __GFP_ACCOUNT);
> > if (sp)
> > return sp;
> >
> > @@ -1449,7 +1453,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
> > write_unlock(&kvm->mmu_lock);
> >
> > iter->yielded = true;
> > - sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT);
> > + sp = __tdp_mmu_alloc_sp_for_split(nid, GFP_KERNEL_ACCOUNT);
> >
> > if (shared)
> > read_lock(&kvm->mmu_lock);
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index d48064503b88..a262e15ebd19 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1583,6 +1583,24 @@ void kvm_arch_sync_events(struct kvm *kvm);
> > int kvm_cpu_has_pending_timer(struct kvm_vcpu *vcpu);
> >
> > struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn);
> > +
> > +/*
> > + * Tells the appropriate NUMA node location of the page table's page based on
> > + * pfn it will point to.
> > + *
> > + * Return the nid of the page if pfn is valid and backed by a refcounted page,
> > + * otherwise, return the nearest memory node for the current CPU.
>
> Nit: Should this be "current thread"?
I will say "current thread CPU". As memory nodes are near to CPUs
whereas threads can execute on multiple CPUs throughout its lifetime.
>
> > + */
> > +static inline int kvm_pfn_to_page_table_nid(kvm_pfn_t pfn)
>
> This could just be kvm_pfn_nid (or even better kvm_pfn_node_id) since
> this really has nothing to do with page tables. We just want to know
> which NUMA node backs the given PFN.
Apart from NUMA node backing the given PFN, it can also return the
nearest NUMA node via numa_mem_id(). So, it is actually telling which
NUMA node is the best one for the page table's page, given a PFN.
>
> > +{
> > + struct page *page = kvm_pfn_to_refcounted_page(pfn);
> > +
> > + if (page)
> > + return page_to_nid(page);
> > + else
> > + return numa_mem_id();
> > +}
> > +
> > bool kvm_is_zone_device_page(struct page *page);
> >
> > struct kvm_irq_ack_notifier {
> > --
> > 2.39.0.314.g84b9a713c41-goog
> >