Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2

From: Mel Gorman
Date: Wed Jan 08 2020 - 03:49:43 EST


On Wed, Jan 08, 2020 at 09:25:38AM +0100, Vincent Guittot wrote:
> On Tue, 7 Jan 2020 at 21:24, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
> >
> > On Tue, Jan 07, 2020 at 05:00:29PM +0100, Vincent Guittot wrote:
> > > > > Taking into account child domain makes sense to me, but shouldn't we
> > > > > take into account the number of child group instead ? This should
> > > > > reflect the number of different LLC caches.
> > > >
> > > > I guess it would but why is it inherently better? The number of domains
> > > > would yield a similar result if we assume that all the lower domains
> > > > have equal weight so it simply because the weight of the SD_NUMA domain
> > > > divided by the number of child domains.
> > >
> > > but that's not what you are doing in your proposal. You are using
> > > directly child->span_weight which reflects the number of CPUs in the
> > > child and not the number of group
> > >
> > > you should do something like sds->busiest->span_weight /
> > > sds->busiest->child->span_weight which gives you an approximation of
> > > the number of independent group inside the busiest numa node from a
> > > shared resource pov
> > >
> >
> > Now I get you, but unfortunately it also would not work out. The number
> > of groups is not related to the LLC except in some specific cases.
> > It's possible to use the first CPU to find the size of an LLC but now I
> > worry that it would lead to unpredictable behaviour. AMD has different
> > numbers of LLCs per node depending on the CPU family and while Intel
> > generally has one LLC per node, I imagine there are counter examples.
> > This means that load balancing on different machines with similar core
> > counts will behave differently due to the LLC size. It might be possible
>
> But the degree of allowed imbalance is related to this topology so
> using the same value for those different machine will generate a
> different behavior because they don't have the same HW topology but we
> use the same threshold
>

The differences in behaviour would be marginal given that the original
fixed value for the v3 patch would generally be smaller than an LLC. For
the moment, I'm assuming that v4 will be based on the number of CPUs in
the node.

> > to infer it if the intermediate domain was DIE instead of MC but I doubt
> > that's guaranteed and it would still be unpredictable. It may be the type
> > of complexity that should only be introduced with a separate patch with
> > clear rationale as to why it's necessary and we are not at that threshold
> > so I withdraw the suggestion.
>
> The problem is that you proposal is not aligned to what you would like
> to do: You want to take into account the number of groups but you use
> the number of CPUs per group instead
>

I'm dropping the check of the child domain entirely. The lookups to get
the LLC size are relatively expensive without any data indicating it's
worthwhile.

--
Mel Gorman
SUSE Labs