Re: [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
From: Konrad Rzeszutek Wilk
Date: Tue Aug 18 2015 - 12:53:55 EST
On August 18, 2015 8:55:32 AM PDT, Dario Faggioli <dario.faggioli@xxxxxxxxxx> wrote:
>Hey everyone,
>
>So, as a followup of what we were discussing in this thread:
>
> [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
>http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html
>
>I started looking in more details at scheduling domains in the Linux
>kernel. Now, that thread was about CPUID and vNUMA, and their weird way
>of interacting, while this thing I'm proposing here is completely
>independent from them both.
>
>In fact, no matter whether vNUMA is supported and enabled, and no
>matter
>whether CPUID is reporting accurate, random, meaningful or completely
>misleading information, I think that we should do something about how
>scheduling domains are build.
>
>Fact is, unless we use 1:1, and immutable (across all the guest
>lifetime) pinning, scheduling domains should not be constructed, in
>Linux, by looking at *any* topology information, because that just does
>not make any sense, when vcpus move around.
>
>Let me state this again (hoping to make myself as clear as possible):
>no
>matter in how much good shape we put CPUID support, no matter how
>beautifully and consistently that will interact with both vNUMA,
>licensing requirements and whatever else. It will be always possible
>for
>vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
>on two different NUMA nodes at time t2. Hence, the Linux scheduler
>should really not skew his load balancing logic toward any of those two
>situations, as neither of them could be considered correct (since
>nothing is!).
What about Windows guests?
>
>For now, this only covers the PV case. HVM case shouldn't be any
>different, but I haven't looked at how to make the same thing happen in
>there as well.
>
>OVERALL DESCRIPTION
>===================
>What this RFC patch does is, in the Xen PV case, configure scheduling
>domains in such a way that there is only one of them, spanning all the
>pCPUs of the guest.
Wow. That is an pretty simple patch!!
>
>Note that the patch deals directly with scheduling domains, and there
>is
>no need to alter the masks that will then be used for building and
>reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). That
>is
>the main difference between it and the patch proposed by Juergen here:
>http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html
>
>This means that when, in future, we will fix CPUID handling and make it
>comply with whatever logic or requirements we want, that won't have
>any
>unexpected side effects on scheduling domains.
>
>Information about how the scheduling domains are being constructed
>during boot are available in `dmesg', if the kernel is booted with the
>'sched_debug' parameter. It is also possible to look
>at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
>
>With the patch applied, only one scheduling domain is created, called
>the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
>tell that from the fact that every cpu* folder
>in /proc/sys/kernel/sched_domain/ only have one subdirectory
>('domain0'), with all the tweaks and the tunables for our scheduling
>domain.
>
...
>
>REQUEST FOR COMMENTS
>====================
>Basically, the kind of feedback I'd be really glad to hear is:
> - what you guys thing of the approach,
> - whether you think, looking at this preliminary set of numbers, that
> this is something worth continuing investigating,
> - if yes, what other workloads and benchmark it would make sense to
> throw at it.
>
The thing that I was worried about is that we would be modifying the generic code, but your changes are all in Xen code!
Woot!
In terms of workloads, I am CCing Herbert who I hope can provide advise on this.
Herbert, the full email is here:
http://lists.xen.org/archives/html/xen-devel/2015-08/msg01691.html
>Thanks and Regards,
>Dario
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/