Re: [Patch v3 7/9] KVM: x86/mmu: Allocate page table's pages on NUMA node of the underlying pages

From: Ben Gardon
Date: Thu Dec 29 2022 - 13:21:00 EST


On Wed, Dec 28, 2022 at 2:08 PM Vipin Sharma <vipinsh@xxxxxxxxxx> wrote:
>
> On Tue, Dec 27, 2022 at 11:34 AM Ben Gardon <bgardon@xxxxxxxxxx> wrote:
> >
> > On Wed, Dec 21, 2022 at 6:35 PM Vipin Sharma <vipinsh@xxxxxxxxxx> wrote:
> > >
> > > Page table pages of a VM are currently allocated based on the current
> > > task's NUMA node or its mempolicy. This can cause suboptimal remote
> > > accesses by the vCPU if it is accessing physical pages local to its NUMA
> > > node but the page table pages mapping those physcal pages were created
> > > by some other vCPU which was on different NUMA node or had different
> > > policy.
> > >
> > > Allocate page table pages on the same NUMA node where underlying
> > > physical page exists. Page table at level 5, 4, and 3 might not end up
> > > on the same NUMA node as they can span multiple NUMA nodes.
> >
> > A page table at any level could map memory spanning multiple NUMA
> > nodes, it just becomes more likely at higher levels.
> > We're only guaranteed that a page table maps memory all on the same
> > node if it's a split hugepage.
>
> Even in this case, it is a best effort.
>
> > This change can only guarantee that the page table pages are allocated
> > on the same node as at least some of the memory they map.
> > Of course in practice, the above is absolutely correct since we'd
> > expect to have multi-GB continuous ranges of GFNs allocated on the
> > same node via huge pages.
> >
> > And since the root pages are allocated based only on where the thread
> > allocating them is running, they're not actually guaranteed to be on
> > the same node as any of the memory they map. (Though they probably
> > will be.)
> >
>
> I will add more details in the commit in the next version.
>
> > >
> > > Signed-off-by: Vipin Sharma <vipinsh@xxxxxxxxxx>
> > > ---
> > > arch/x86/include/asm/kvm_host.h | 2 +-
> > > arch/x86/kvm/mmu/mmu.c | 63 ++++++++++++++++++++++-----------
> > > arch/x86/kvm/mmu/paging_tmpl.h | 4 +--
> > > arch/x86/kvm/mmu/tdp_mmu.c | 11 +++---
> > > virt/kvm/kvm_main.c | 2 +-
> > > 5 files changed, 53 insertions(+), 29 deletions(-)
> > >
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index 293994fabae3..b1f319ad6f89 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -782,7 +782,7 @@ struct kvm_vcpu_arch {
> > > struct kvm_mmu *walk_mmu;
> > >
> > > struct kvm_mmu_memory_cache mmu_pte_list_desc_cache;
> > > - struct kvm_mmu_memory_cache mmu_shadow_page_cache;
> > > + struct kvm_mmu_memory_cache mmu_shadow_page_cache[MAX_NUMNODES];
> > > struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
> > > struct kvm_mmu_memory_cache mmu_page_header_cache;
> > >
> > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > > index 23a3b82b2384..511c6ef265ee 100644
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -677,24 +677,29 @@ static int mmu_topup_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> > >
> > > static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
> > > {
> > > - int r;
> > > + int r, nid;
> > >
> > > /* 1 rmap, 1 parent PTE per level, and the prefetched rmaps. */
> > > r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache,
> > > 1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
> > > if (r)
> > > return r;
> > > - r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> > > - &vcpu->arch.mmu_shadow_page_cache_lock,
> > > - PT64_ROOT_MAX_LEVEL);
> > > - if (r)
> > > - return r;
> > > +
> > > + for_each_online_node(nid) {
> > > + r = mmu_topup_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache[nid],
> > > + &vcpu->arch.mmu_shadow_page_cache_lock,
> > > + PT64_ROOT_MAX_LEVEL);
> > > + if (r)
> > > + return r;
> > > + }
> > > +
> > > if (maybe_indirect) {
> > > r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadowed_info_cache,
> > > PT64_ROOT_MAX_LEVEL);
> > > if (r)
> > > return r;
> > > }
> > > +
> > > return kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_page_header_cache,
> > > PT64_ROOT_MAX_LEVEL);
> > > }
> > > @@ -715,9 +720,14 @@ static void mmu_free_sp_memory_cache(struct kvm_mmu_memory_cache *cache,
> > >
> > > static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
> > > {
> > > + int nid;
> > > +
> > > kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
> > > - mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> > > - &vcpu->arch.mmu_shadow_page_cache_lock);
> > > +
> > > + for_each_node(nid)
> > > + mmu_free_sp_memory_cache(&vcpu->arch.mmu_shadow_page_cache[nid],
> > > + &vcpu->arch.mmu_shadow_page_cache_lock);
> > > +
> >
> > Was just trying to think if there could be any issue with memory
> > leakage if the online nodes changed, though IDK if any hardware does
> > that.
> > Still, it might be more robust to use ARRAY_SIZE and cover the whole array.
>
> for_each_node() goes through all of the possible nodes on the system,
> whereas, for_each_online_node() goes through only online nodes.
> Current code seems right to me, let me know if I am overlooking
> something.

Ah okay, I didn't see the distinction. That sounds good to me.

>
> >
> > > kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
> > > kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
> > > }
> > > @@ -2256,11 +2266,12 @@ static struct kvm_mmu_page *__kvm_mmu_get_shadow_page(struct kvm *kvm,
> > >
> > > static struct kvm_mmu_page *kvm_mmu_get_shadow_page(struct kvm_vcpu *vcpu,
> > > gfn_t gfn,
> > > - union kvm_mmu_page_role role)
> > > + union kvm_mmu_page_role role,
> > > + int nid)
> > > {
> > > struct shadow_page_caches caches = {
> > > .page_header_cache = &vcpu->arch.mmu_page_header_cache,
> > > - .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache,
> > > + .shadow_page_cache = &vcpu->arch.mmu_shadow_page_cache[nid],
> > > .shadowed_info_cache = &vcpu->arch.mmu_shadowed_info_cache,
> > > .shadow_page_cache_lock = &vcpu->arch.mmu_shadow_page_cache_lock
> > > };
> > > @@ -2316,15 +2327,19 @@ static union kvm_mmu_page_role kvm_mmu_child_role(u64 *sptep, bool direct,
> > >
> > > static struct kvm_mmu_page *kvm_mmu_get_child_sp(struct kvm_vcpu *vcpu,
> > > u64 *sptep, gfn_t gfn,
> > > - bool direct, unsigned int access)
> > > + bool direct, unsigned int access,
> > > + kvm_pfn_t pfn)
> > > {
> > > union kvm_mmu_page_role role;
> > > + int nid;
> > >
> > > if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep))
> > > return ERR_PTR(-EEXIST);
> > >
> > > role = kvm_mmu_child_role(sptep, direct, access);
> > > - return kvm_mmu_get_shadow_page(vcpu, gfn, role);
> > > + nid = kvm_pfn_to_page_table_nid(pfn);
> > > +
> > > + return kvm_mmu_get_shadow_page(vcpu, gfn, role, nid);
> > > }
> > >
> > > static void shadow_walk_init_using_root(struct kvm_shadow_walk_iterator *iterator,
> > > @@ -3208,7 +3223,8 @@ static int direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > > if (it.level == fault->goal_level)
> > > break;
> > >
> > > - sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true, ACC_ALL);
> > > + sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn, true,
> > > + ACC_ALL, fault->pfn);
> > > if (sp == ERR_PTR(-EEXIST))
> > > continue;
> > >
> > > @@ -3636,7 +3652,7 @@ static hpa_t mmu_alloc_root(struct kvm_vcpu *vcpu, gfn_t gfn, int quadrant,
> > > WARN_ON_ONCE(quadrant && !role.has_4_byte_gpte);
> > > WARN_ON_ONCE(role.direct && role.has_4_byte_gpte);
> > >
> > > - sp = kvm_mmu_get_shadow_page(vcpu, gfn, role);
> > > + sp = kvm_mmu_get_shadow_page(vcpu, gfn, role, numa_mem_id());
> > > ++sp->root_count;
> > >
> > > return __pa(sp->spt);
> > > @@ -5952,7 +5968,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
> > >
> > > int kvm_mmu_create(struct kvm_vcpu *vcpu)
> > > {
> > > - int ret;
> > > + int ret, nid;
> > >
> > > INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_pte_list_desc_cache,
> > > pte_list_desc_cache, NUMA_NO_NODE);
> > > @@ -5960,8 +5976,9 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
> > > INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_page_header_cache,
> > > mmu_page_header_cache, NUMA_NO_NODE);
> > >
> > > - INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache,
> > > - NULL, NUMA_NO_NODE);
> > > + for_each_node(nid)
> > > + INIT_KVM_MMU_MEMORY_CACHE(&vcpu->arch.mmu_shadow_page_cache[nid],
> > > + NULL, nid);
> > > spin_lock_init(&vcpu->arch.mmu_shadow_page_cache_lock);
> > >
> > > vcpu->arch.mmu = &vcpu->arch.root_mmu;
> > > @@ -6692,13 +6709,17 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
> > > }
> > >
> > > static unsigned long mmu_shrink_cache(struct kvm_mmu_memory_cache *cache,
> > > + int cache_count,
> > > spinlock_t *cache_lock)
> > > {
> > > unsigned long freed = 0;
> > > + int nid;
> > >
> > > spin_lock(cache_lock);
> > > - if (cache->nobjs)
> > > - freed = kvm_mmu_empty_memory_cache(cache);
> > > + for (nid = 0; nid < cache_count; nid++) {
> > > + if (node_online(nid) && cache[nid].nobjs)
> >
> > Is there any reason to keep the cache if !node_online(nid)?
> > Actually, I'd also just drop the cache_count argument and always
> > iterate over the entire array, only checking nobjs. There's no
> > guarantee I'm aware of that the set of nodes has a sequential series
> > of IDs starting at 0 and you'd get a bug if that wasn't the case since
> > it only iterates to nid < cache_count here but some of the earlier
> > nids might not have been online.
> >
>
> This is just temporary and will be removed in the next patch in the series.
>
> mmu_shrink_cache() is used for both split_shadow_page_cache (single
> object) and mmu_shadow_page_cache[MAX_NUMANODES].
>
> In next patch of this series, I used for_each_online_node(nide), I
> will change it to for_each_node() in the next version.
>
> > > + freed += kvm_mmu_empty_memory_cache(&cache[nid]);
> > > + }
> > > spin_unlock(cache_lock);
> > > return freed;
> > > }
> > > @@ -6721,13 +6742,15 @@ mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> > > list_move_tail(&kvm->vm_list, &vm_list);
> > >
> > > freed += mmu_shrink_cache(&kvm->arch.split_shadow_page_cache,
> > > + 1,
> >
> > So lonely.
> > One.
> > All by itself,
> > with only a coma for company.
> >
> > NIT: This could be merged to the previous or subsequent lines.
>
> This is a strong and independent '1'.
>
> >
> > > &kvm->arch.split_shadow_page_cache_lock);
> > >
> > > if (freed >= sc->nr_to_scan)
> > > break;
> > >
> > > kvm_for_each_vcpu(i, vcpu, kvm) {
> > > - freed += mmu_shrink_cache(&vcpu->arch.mmu_shadow_page_cache,
> > > + freed += mmu_shrink_cache(vcpu->arch.mmu_shadow_page_cache,
> > > + MAX_NUMNODES,
> > > &vcpu->arch.mmu_shadow_page_cache_lock);
> > > }
> > >
> > > diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> > > index e5662dbd519c..1ceca62ec4cf 100644
> > > --- a/arch/x86/kvm/mmu/paging_tmpl.h
> > > +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> > > @@ -652,7 +652,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > > table_gfn = gw->table_gfn[it.level - 2];
> > > access = gw->pt_access[it.level - 2];
> > > sp = kvm_mmu_get_child_sp(vcpu, it.sptep, table_gfn,
> > > - false, access);
> > > + false, access, fault->pfn);
> > >
> > > if (sp != ERR_PTR(-EEXIST)) {
> > > /*
> > > @@ -708,7 +708,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> > > validate_direct_spte(vcpu, it.sptep, direct_access);
> > >
> > > sp = kvm_mmu_get_child_sp(vcpu, it.sptep, base_gfn,
> > > - true, direct_access);
> > > + true, direct_access, fault->pfn);
> > > if (sp == ERR_PTR(-EEXIST))
> > > continue;
> > >
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index 376b8dceb3f9..b5abae2366dd 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -259,12 +259,12 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
> > > kvm_mmu_page_as_id(_root) != _as_id) { \
> > > } else
> > >
> > > -static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
> > > +static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu, int nid)
> > > {
> > > struct kvm_mmu_page *sp;
> > >
> > > sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> > > - sp->spt = kvm_mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache,
> > > + sp->spt = kvm_mmu_sp_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache[nid],
> > > &vcpu->arch.mmu_shadow_page_cache_lock);
> > >
> > > return sp;
> > > @@ -317,7 +317,7 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> > > goto out;
> > > }
> > >
> > > - root = tdp_mmu_alloc_sp(vcpu);
> > > + root = tdp_mmu_alloc_sp(vcpu, numa_mem_id());
> >
> > Might be worth calling out somewhere that the root page is just
> > allocated based on where the thread allocating it runs.
> >
>
> How about a comment just up here or do you prefer at tdp_mmu_roots in
> struct kvm_arch{}?

Here or just in the commit description or cover letter.
Thanks!

>
> > > tdp_mmu_init_sp(root, NULL, 0, role);
> > >
> > > refcount_set(&root->tdp_mmu_root_count, 1);
> > > @@ -1149,7 +1149,7 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > > struct kvm *kvm = vcpu->kvm;
> > > struct tdp_iter iter;
> > > struct kvm_mmu_page *sp;
> > > - int ret = RET_PF_RETRY;
> > > + int ret = RET_PF_RETRY, nid;
> > >
> > > kvm_mmu_hugepage_adjust(vcpu, fault);
> > >
> > > @@ -1178,11 +1178,12 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> > > !is_large_pte(iter.old_spte))
> > > continue;
> > >
> > > + nid = kvm_pfn_to_page_table_nid(fault->pfn);
> > > /*
> > > * The SPTE is either non-present or points to a huge page that
> > > * needs to be split.
> > > */
> > > - sp = tdp_mmu_alloc_sp(vcpu);
> > > + sp = tdp_mmu_alloc_sp(vcpu, nid);
> > > tdp_mmu_init_child_sp(sp, &iter);
> > >
> > > sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
> > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > > index d96c8146e9ba..4f3db7ffeba8 100644
> > > --- a/virt/kvm/kvm_main.c
> > > +++ b/virt/kvm/kvm_main.c
> > > @@ -415,7 +415,7 @@ static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
> > > if (mc->kmem_cache)
> > > return kmem_cache_alloc(mc->kmem_cache, gfp_flags);
> > > else
> > > - return (void *)__get_free_page(gfp_flags);
> > > + return kvm_mmu_get_free_page(mc->node, gfp_flags);
> >
> > You could do part of this change in the commit that introduced
> > kvm_mmu_get_free_page too.
>
> Yeah, I can do it there as well. No strong opinions. I will update in
> the next version.
>
> > > }
> > >
> > > int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
> > > --
> > > 2.39.0.314.g84b9a713c41-goog
> > >