Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy

From: Juergen Gross
Date: Mon Sep 21 2015 - 01:49:14 EST


On 09/15/2015 06:50 PM, Dario Faggioli wrote:
On Thu, 2015-08-20 at 20:16 +0200, Juergen Groà wrote:
On 08/18/2015 05:55 PM, Dario Faggioli wrote:
Hey everyone,

So, as a followup of what we were discussing in this thread:

[Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html

I started looking in more details at scheduling domains in the Linux
kernel. Now, that thread was about CPUID and vNUMA, and their weird way
of interacting, while this thing I'm proposing here is completely
independent from them both.

In fact, no matter whether vNUMA is supported and enabled, and no matter
whether CPUID is reporting accurate, random, meaningful or completely
misleading information, I think that we should do something about how
scheduling domains are build.

Fact is, unless we use 1:1, and immutable (across all the guest
lifetime) pinning, scheduling domains should not be constructed, in
Linux, by looking at *any* topology information, because that just does
not make any sense, when vcpus move around.

Let me state this again (hoping to make myself as clear as possible): no
matter in how much good shape we put CPUID support, no matter how
beautifully and consistently that will interact with both vNUMA,
licensing requirements and whatever else. It will be always possible for
vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
on two different NUMA nodes at time t2. Hence, the Linux scheduler
should really not skew his load balancing logic toward any of those two
situations, as neither of them could be considered correct (since
nothing is!).

For now, this only covers the PV case. HVM case shouldn't be any
different, but I haven't looked at how to make the same thing happen in
there as well.

OVERALL DESCRIPTION
===================
What this RFC patch does is, in the Xen PV case, configure scheduling
domains in such a way that there is only one of them, spanning all the
pCPUs of the guest.

Note that the patch deals directly with scheduling domains, and there is
no need to alter the masks that will then be used for building and
reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). That is
the main difference between it and the patch proposed by Juergen here:
http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html

This means that when, in future, we will fix CPUID handling and make it
comply with whatever logic or requirements we want, that won't have any
unexpected side effects on scheduling domains.

Information about how the scheduling domains are being constructed
during boot are available in `dmesg', if the kernel is booted with the
'sched_debug' parameter. It is also possible to look
at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.

With the patch applied, only one scheduling domain is created, called
the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
tell that from the fact that every cpu* folder
in /proc/sys/kernel/sched_domain/ only have one subdirectory
('domain0'), with all the tweaks and the tunables for our scheduling
domain.

EVALUATION
==========
I've tested this with UnixBench, and by looking at Xen build time, on a
16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for
now, but I plan to re-run them in DomUs soon (Juergen may be doing
something similar to this in DomU already, AFAUI).

I've run the benchmarks with and without the patch applied ('patched'
and 'vanilla', respectively, in the tables below), and with different
number of build jobs (in case of the Xen build) or of parallel copy of
the benchmarks (in the case of UnixBench).

What I get from the numbers is that the patch almost always brings
benefits, in some cases even huge ones. There are a couple of cases
where we regress, but always only slightly so, especially if comparing
that to the magnitude of some of the improvement that we get.

Bear also in mind that these results are gathered from Dom0, and without
any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If
we move things in DomU and do overcommit at the Xen scheduler level, I
am expecting even better results.

...
REQUEST FOR COMMENTS
====================
Basically, the kind of feedback I'd be really glad to hear is:
- what you guys thing of the approach,

Yesterday at the end of the developer meeting we (Andrew, Elena and
myself) discussed this topic again.

Hey,

Sorry for replying so late, I've been on vacation from right after
XenSummit up until yesterday. :-)

Regarding a possible future scenario with credit2 eventually supporting
gang scheduling on hyperthreads (which is desirable due to security
reasons [side channel attack] and fairness) my patch seems to be more
suited for that direction than yours.

Ok. Just let me mention that 'Credit2 + gang scheduling' might not be
exactly around the corner (although, we can prioritize working on it if
we want).

In principle, I think it's a really nice idea. I still don't have clear
in mind how we would handle a couple of situations, but let's leave this
aside for now, and stay on-topic.

Correct me if I'm wrong, but I
think scheduling domains won't enable the guest kernel's scheduler to
migrate threads more easily between hyperthreads opposed to other vcpus,
while my approach can easily be extended to do so.

I'm not sure I understand what you mean here. As far as the (Linux)
scheduler is concerned, your patch and mine do the exact same thing:
they arrange for the scheduling domains, when they're built, during
boot, not to consider hyperthreads or multi-cores.

Mine does it by removing the SMT (and the MC) level from the data
structure in the scheduler that is used as a base for configuring the
scheduling domains. Yours does it by making the topology bitmaps that
are used at each one of those level all look the same. In fact, with
your patch applied, I get the exact same situation as with mine, as far
as scheduling domains are concerned: there is only one scheduling
domain, with a different scheduling group for each vCPU inside it.

Uuh, nearly.

Your case won't deal correctly with NUMA, as the generic NUMA code is
using set_sched_topology() as well. One of NUMA and Xen will win and
overwrite the other's settings.

To do things correctly you will have to handle NUMA as well.


Juergen

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/