Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy
From: Dario Faggioli
Date: Wed Sep 23 2015 - 03:25:06 EST
On Mon, 2015-09-21 at 07:49 +0200, Juergen Gross wrote:
> On 09/15/2015 06:50 PM, Dario Faggioli wrote:
> > On Thu, 2015-08-20 at 20:16 +0200, Juergen Groà wrote:
> > > On 08/18/2015 05:55 PM, Dario Faggioli wrote:
> > > > Hey everyone,
> > > >
> > > > So, as a followup of what we were discussing in this thread:
> > > >
> > > > [Xen-devel] PV-vNUMA issue: topology is misinterpreted by
> > > > the guest
> > > > http://lists.xenproject.org/archives/html/xen-devel/2015-07/
> > > > msg03241.html
> > > >
> > > > I started looking in more details at scheduling domains in the
> > > > Linux
> > > > kernel. Now, that thread was about CPUID and vNUMA, and their
> > > > weird way
> > > > of interacting, while this thing I'm proposing here is
> > > > completely
> > > > independent from them both.
> > > >
> > > > In fact, no matter whether vNUMA is supported and enabled, and
> > > > no matter
> > > > whether CPUID is reporting accurate, random, meaningful or
> > > > completely
> > > > misleading information, I think that we should do something
> > > > about how
> > > > scheduling domains are build.
> > > >
> > > > Fact is, unless we use 1:1, and immutable (across all the guest
> > > > lifetime) pinning, scheduling domains should not be
> > > > constructed, in
> > > > Linux, by looking at *any* topology information, because that
> > > > just does
> > > > not make any sense, when vcpus move around.
> > > >
> > > > Let me state this again (hoping to make myself as clear as
> > > > possible): no
> > > > matter in how much good shape we put CPUID support, no matter
> > > > how
> > > > beautifully and consistently that will interact with both
> > > > vNUMA,
> > > > licensing requirements and whatever else. It will be always
> > > > possible for
> > > > vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time
> > > > t1, and
> > > > on two different NUMA nodes at time t2. Hence, the Linux
> > > > scheduler
> > > > should really not skew his load balancing logic toward any of
> > > > those two
> > > > situations, as neither of them could be considered correct
> > > > (since
> > > > nothing is!).
> > > >
> > > > For now, this only covers the PV case. HVM case shouldn't be
> > > > any
> > > > different, but I haven't looked at how to make the same thing
> > > > happen in
> > > > there as well.
> > > >
> > > > OVERALL DESCRIPTION
> > > > ===================
> > > > What this RFC patch does is, in the Xen PV case, configure
> > > > scheduling
> > > > domains in such a way that there is only one of them, spanning
> > > > all the
> > > > pCPUs of the guest.
> > > >
> > > > Note that the patch deals directly with scheduling domains, and
> > > > there is
> > > > no need to alter the masks that will then be used for building
> > > > and
> > > > reporting the topology (via CPUID, /proc/cpuinfo, /sysfs,
> > > > etc.). That is
> > > > the main difference between it and the patch proposed by
> > > > Juergen here:
> > > > http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg
> > > > 05088.html
> > > >
> > > > This means that when, in future, we will fix CPUID handling and
> > > > make it
> > > > comply with whatever logic or requirements we want, that won't
> > > > have any
> > > > unexpected side effects on scheduling domains.
> > > >
> > > > Information about how the scheduling domains are being
> > > > constructed
> > > > during boot are available in `dmesg', if the kernel is booted
> > > > with the
> > > > 'sched_debug' parameter. It is also possible to look
> > > > at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
> > > >
> > > > With the patch applied, only one scheduling domain is created,
> > > > called
> > > > the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs.
> > > > You can
> > > > tell that from the fact that every cpu* folder
> > > > in /proc/sys/kernel/sched_domain/ only have one subdirectory
> > > > ('domain0'), with all the tweaks and the tunables for our
> > > > scheduling
> > > > domain.
> > > >
> > > > EVALUATION
> > > > ==========
> > > > I've tested this with UnixBench, and by looking at Xen build
> > > > time, on a
> > > > 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0
> > > > only, for
> > > > now, but I plan to re-run them in DomUs soon (Juergen may be
> > > > doing
> > > > something similar to this in DomU already, AFAUI).
> > > >
> > > > I've run the benchmarks with and without the patch applied
> > > > ('patched'
> > > > and 'vanilla', respectively, in the tables below), and with
> > > > different
> > > > number of build jobs (in case of the Xen build) or of parallel
> > > > copy of
> > > > the benchmarks (in the case of UnixBench).
> > > >
> > > > What I get from the numbers is that the patch almost always
> > > > brings
> > > > benefits, in some cases even huge ones. There are a couple of
> > > > cases
> > > > where we regress, but always only slightly so, especially if
> > > > comparing
> > > > that to the magnitude of some of the improvement that we get.
> > > >
> > > > Bear also in mind that these results are gathered from Dom0,
> > > > and without
> > > > any overcommitment at the vCPU level (i.e., nr. vCPUs == nr
> > > > pCPUs). If
> > > > we move things in DomU and do overcommit at the Xen scheduler
> > > > level, I
> > > > am expecting even better results.
> > > >
> > > ...
> > > > REQUEST FOR COMMENTS
> > > > ====================
> > > > Basically, the kind of feedback I'd be really glad to hear is:
> > > > - what you guys thing of the approach,
> > >
> > > Yesterday at the end of the developer meeting we (Andrew, Elena
> > > and
> > > myself) discussed this topic again.
> > >
> > Hey,
> >
> > Sorry for replying so late, I've been on vacation from right after
> > XenSummit up until yesterday. :-)
> >
> > > Regarding a possible future scenario with credit2 eventually
> > > supporting
> > > gang scheduling on hyperthreads (which is desirable due to
> > > security
> > > reasons [side channel attack] and fairness) my patch seems to be
> > > more
> > > suited for that direction than yours.
> > >
> > Ok. Just let me mention that 'Credit2 + gang scheduling' might not
> > be
> > exactly around the corner (although, we can prioritize working on
> > it if
> > we want).
> >
> > In principle, I think it's a really nice idea. I still don't have
> > clear
> > in mind how we would handle a couple of situations, but let's leave
> > this
> > aside for now, and stay on-topic.
> >
> > > Correct me if I'm wrong, but I
> > > think scheduling domains won't enable the guest kernel's
> > > scheduler to
> > > migrate threads more easily between hyperthreads opposed to other
> > > vcpus,
> > > while my approach can easily be extended to do so.
> > >
> > I'm not sure I understand what you mean here. As far as the (Linux)
> > scheduler is concerned, your patch and mine do the exact same
> > thing:
> > they arrange for the scheduling domains, when they're built, during
> > boot, not to consider hyperthreads or multi-cores.
> >
> > Mine does it by removing the SMT (and the MC) level from the data
> > structure in the scheduler that is used as a base for configuring
> > the
> > scheduling domains. Yours does it by making the topology bitmaps
> > that
> > are used at each one of those level all look the same. In fact,
> > with
> > your patch applied, I get the exact same situation as with mine, as
> > far
> > as scheduling domains are concerned: there is only one scheduling
> > domain, with a different scheduling group for each vCPU inside it.
>
> Uuh, nearly.
>
> Your case won't deal correctly with NUMA, as the generic NUMA code is
> using set_sched_topology() as well.
>
Mmm... have you tried and seen something like this? AFAICT, the NUMA
related setup steps of scheduling domains happens after the basic (as
in "without taking NUMAness into account") topology has been set
already, and builds on top of it.
It uses set_sched_topology() only in a special case which, I'm not sure
we'd be hitting.
I'm asking because trying this out, right now, is not straightforward,
as PV vNUMA, even with Wei's Linux patches and with either yours or
mine one, still incurs in the CPUID issue... I'll try that ASAP, but
there are a couple of things I've got to finish for the next few days.
> One of NUMA and Xen will win and
> overwrite the other's settings.
>
Not sure what this means, but as I said, I'll try.
Regards,
Dario
--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
Attachment:
signature.asc
Description: This is a digitally signed message part