Re: [RFC:kvm] export host NUMA info to guest & make emulated deviceNUMA attr

From: Michael S. Tsirkin
Date: Wed May 23 2012 - 11:29:12 EST


On Wed, May 23, 2012 at 09:52:15AM -0500, Andrew Theurer wrote:
> On 05/22/2012 04:28 AM, Liu ping fan wrote:
> >On Sat, May 19, 2012 at 12:14 AM, Shirley Ma<mashirle@xxxxxxxxxx> wrote:
> >>On Thu, 2012-05-17 at 17:20 +0800, Liu Ping Fan wrote:
> >>>Currently, the guest can not know the NUMA info of the vcpu, which
> >>>will
> >>>result in performance drawback.
> >>>
> >>>This is the discovered and experiment by
> >>> Shirley Ma<xma@xxxxxxxxxx>
> >>> Krishna Kumar<krkumar2@xxxxxxxxxx>
> >>> Tom Lendacky<toml@xxxxxxxxxx>
> >>>Refer to -
> >>>http://www.mail-archive.com/kvm@xxxxxxxxxxxxxxx/msg69868.html
> >>>we can see the big perfermance gap between NUMA aware and unaware.
> >>>
> >>>Enlightened by their discovery, I think, we can do more work -- that
> >>>is to
> >>>export NUMA info of host to guest.
> >>
> >>There three problems we've found:
> >>
> >>1. KVM doesn't support NUMA load balancer. Even there are no other
> >>workloads in the system, and the number of vcpus on the guest is smaller
> >>than the number of cpus per node, the vcpus could be scheduled on
> >>different nodes.
> >>
> >>Someone is working on in-kernel solution. Andrew Theurer has a working
> >>user-space NUMA aware VM balancer, it requires libvirt and cgroups
> >>(which is default for RHEL6 systems).
> >>
> >Interesting, and I found that "sched/numa: Introduce
> >sys_numa_{t,m}bind()" committed by Peter and Ingo may help.
> >But I think from the guest view, it can not tell whether the two vcpus
> >are on the same host node. For example,
> >vcpu-a in node-A is not vcpu-b in node-B, the guest lb will be more
> >expensive if it pull_task from vcpu-a and
> >choose vcpu-b to push. And my idea is to export such info to guest,
> >still working on it.
>
> The long term solution is to two-fold:
> 1) Guests that are quite large (in that they cannot fit in a host
> NUMA node) must have static mulit-node NUMA topology implemented by
> Qemu. That is here today, but we do not do it automatically, which
> is probably going to be a VM management responsibility.
> 2) Host scheduler and NUMA code must be enhanced to get better
> placement of Qemu memory and threads. For single-node vNUMA guests,
> this is easy, put it all in one node. For mulit-node vNUMA guests,
> the host must understand that some Qemu memory belongs with certain
> vCPU threads (which make up one of the guests vNUMA nodes), and then
> place that memory/threads in a specific host node (and continue for
> other memory/threads for each Qemu vNUMA node).

And for IO, we need multiqueue devices such that each
node can have its own queue in its local memory.

--
MST
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/