Re: [Patch v4 11/18] KVM: x86/mmu: Add documentation of NUMA aware page table capability

From: Vipin Sharma
Date: Tue Mar 28 2023 - 12:48:21 EST


On Thu, Mar 23, 2023 at 2:59 PM David Matlack <dmatlack@xxxxxxxxxx> wrote:
>
> On Mon, Mar 06, 2023 at 02:41:20PM -0800, Vipin Sharma wrote:
> > Add documentation for KVM_CAP_NUMA_AWARE_PAGE_TABLE capability and
> > explain why it is needed.
> >
> > Signed-off-by: Vipin Sharma <vipinsh@xxxxxxxxxx>
> > ---
> > Documentation/virt/kvm/api.rst | 29 +++++++++++++++++++++++++++++
> > 1 file changed, 29 insertions(+)
> >
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 62de0768d6aa..7e3a1299ca8e 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -7669,6 +7669,35 @@ This capability is aimed to mitigate the threat that malicious VMs can
> > cause CPU stuck (due to event windows don't open up) and make the CPU
> > unavailable to host or other VMs.
> >
> > +7.34 KVM_CAP_NUMA_AWARE_PAGE_TABLE
> > +------------------------------
> > +
> > +:Architectures: x86
> > +:Target: VM
> > +:Returns: 0 on success, -EINVAL if vCPUs are already created.
> > +
> > +This capability allows userspace to enable NUMA aware page tables allocations.
>
> Call out that this capability overrides task mempolicies. e.g.
>
> This capability causes KVM to use a custom NUMA memory policy when
> allocating page tables. Specifically, KVM will attempt to co-locate
> page tables pages with the memory that they map, rather than following
> the mempolicy of the current task.
>
> > +NUMA aware page tables are disabled by default. Once enabled, prior to vCPU
> > +creation, any page table allocated during the life of a VM will be allocated
>
> The "prior to vCPU creation" part here is confusing because it sounds
> like you're talking about any page tables allocated before vCPU
> creation. Just delete that part and put it in a separate paragraph.
>
> KVM_CAP_NUMA_AWARE_PAGE_TABLE must be enabled before any vCPU is
> created, otherwise KVM will return -EINVAL.
>
> > +preferably from the NUMA node of the leaf page.
> > +
> > +Without this capability, default feature is to use current thread mempolicy and
>
> s/default feature is to/KVM will/
>
> > +allocate page table based on that.
>
> s/and allocate page table based on that./to allocate page tables./
>
> > +
> > +This capability is useful to improve page accesses by a guest. For example, an
>
> nit: Be more specific about how.
>
> This capability aims to minimize the cost of TLB misses when a vCPU is
> accessing NUMA-local memory, by reducing the number of remote memory
> accesses needed to walk KVM's page tables.
>
> > +initialization thread which access lots of remote memory and ends up creating
> > +page tables on local NUMA node, or some service thread allocates memory on
> > +remote NUMA nodes and later worker/background threads accessing that memory
> > +will end up accessing remote NUMA node page tables.
>
> It's not clear if these examples are talking about what happens when
> KVM_CAP_NUMA_AWARE_PAGE_TABLE is enabled or disabled.
>
> Also it's important to distinguish virtual NUMA nodes from physical NUMA
> nodes and where these "threads" are running. How about this:
>
> For example, when KVM_CAP_NUMA_AWARE_PAGE_TABLE is disabled and a vCPU
> accesses memory on a remote NUMA node and triggers a KVM page fault,
> KVM will allocate page tables to handle that fault on the node where
> the vCPU is running rather than the node where the memory is allocated.
> When KVM_CAP_NUMA_AWARE_PAGE_TABLE is enabled, KVM will allocate the
> page tables on the node where the memory is located.
>
> This is intended to be used in VM configurations that properly
> virtualize NUMA. i.e. VMs with one or more virtual NUMA nodes, each of
> which is mapped to a physical NUMA node. With this capability enabled
> on such VMs, any guest memory access to virtually-local memory will be
> translated through mostly[*] physically-local page tables, regardless
> of how the memory was faulted in.
>
> [*] KVM will fallback to allocating from remote NUMA nodes if the
> preferred node is out of memory. Also, in VMs with 2 or more NUMA
> nodes, higher level page tables will necessarily map memory across
> multiple physical nodes.
>
> > So, a multi NUMA node
> > +guest, can with high confidence access local memory faster instead of going
> > +through remote page tables first.
> > +
> > +This capability is also helpful for host to reduce live migration impact when
> > +splitting huge pages during dirty log operations. If the thread splitting huge
> > +page is on remote NUMA node it will create page tables on remote node. Even if
> > +guest is careful in making sure that it only access local memory they will end
> > +up accessing remote page tables.
>
> Please also cover the limitations of this feature:
>
> - Impact on remote memory accesses (more expensive).
> - How KVM handles NUMA node exhaustion.
> - How high-level page tables can span multiple nodes.
> - What KVM does if it can't determine the NUMA node of the pfn.
> - What KVM does for faults on GPAs that aren't backed by a pfn.
>

Thanks for the suggestions, I will incorporate them in the next version.