Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2

From: Mel Gorman
Date: Tue Jan 07 2020 - 06:56:51 EST


On Tue, Jan 07, 2020 at 12:17:12PM +0100, Vincent Guittot wrote:
> On Tue, 7 Jan 2020 at 10:56, Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> wrote:
> >
> > On Tue, Jan 07, 2020 at 09:38:26AM +0100, Vincent Guittot wrote:
> > > > > This looks weird to me because you use imbalance_pct, which is
> > > > > meaningful only compare a ratio, to define a number that will be then
> > > > > compared to a number of tasks without taking into account the weight
> > > > > of the node. So whatever the node size, 32 or 128 CPUs, the
> > > > > imbalance_adj will be the same: 3 with the default imbalance_pct of
> > > > > NUMA level which is 125 AFAICT
> > > > >
> > > >
> > > > The intent in this version was to only cover the low utilisation case
> > > > regardless of the NUMA node size. There were too many corner cases
> > > > where the tradeoff of local memory latency versus local memory bandwidth
> > > > cannot be quantified. See Srikar's report as an example but it's a general
> > > > problem. The use of imbalance_pct was simply to find the smallest number of
> > > > running tasks were (imbalance_pct - 100) would be 1 running task and limit
> > >
> > > But using imbalance_pct alone doesn't mean anything.
> >
> > Other than figuring out "how many running tasks are required before
> > imbalance_pct is roughly equivalent to one fully active CPU?". Even then,
>
> sorry, I don't see how you deduct this from only using imbalance_pct
> which is mainly there to say how much percent of difference is
> significant
>

Because if the difference is 25% then 1 CPU out of 4 active is enough
for imbalance_pct to potentially be a factor. Anyway, the approach seems
almost universally disliked so even if I had reasons for not scaling
this by the group_weight, no one appears to agree with them :P

> > it's a bit weak as imbalance_pct makes hard-coded assumptions on what
> > the tradeoff of cross-domain migration is without considering the hardware.
> >
> > > Using similar to the below
> > >
> > > busiest->group_weight * (env->sd->imbalance_pct - 100) / 100
> > >
> > > would be more meaningful
> > >
> >
> > It's meaningful to some sense from a conceptual point of view but
> > setting the low utilisation cutoff depending on the number of CPUs on
> > the node does not account for any local memory latency vs bandwidth.
> > i.e. on a small or mid-range machine the cutoff will make sense. On
> > larger machines, the cutoff could be at the point where memory bandwidth
> > is saturated leading to a scenario whereby upgrading to a larger
> > machine performs worse than the smaller machine.
> >
> > Much more importantly, doing what you suggest allows an imbalance
> > of more CPUs than are backed by a single LLC. On high-end AMD EPYC 2
> > machines, busiest->group_weight scaled by imbalance_pct spans multiple L3
> > caches. That is going to have side-effects. While I also do not account
> > for the LLC group_weight, it's unlikely the cut-off I used would be
> > smaller than an LLC cache on a large machine as the cache.
> >
> > These two points are why I didn't take the group weight into account.
> >
> > Now if you want, I can do what you suggest anyway as long as you are happy
> > that the child domain weight is also taken into account and to bound the
>
> Taking into account child domain makes sense to me, but shouldn't we
> take into account the number of child group instead ? This should
> reflect the number of different LLC caches.

I guess it would but why is it inherently better? The number of domains
would yield a similar result if we assume that all the lower domains
have equal weight so it simply because the weight of the SD_NUMA domain
divided by the number of child domains.

Now, I could be missing something with asymmetric setups. I don't know
if it's possible for child domains of a NUMA domain to have different
sizes. I would be somewhat surprised if they did but I also do not work
on such machines nor have I ever accessed one (to my knowledge).

> IIUC your reasoning, we want to make sure that tasks will not start to
> fight for using same resources which is the connection between LLC
> cache and memory in this case
>

Yep. I don't want a case where the allowed imbalance causes the load
balancer to have to balance between the lower domains. *Maybe* that is
actually better in some cases but it's far from intuitive so I would
prefer that change would be a patch on its own with a big fat comment
explaining the reasoning behind the additional complexity.

> > largest possible allowed imbalance to deal with the case of a node having
> > multiple small LLC caches. That means that some machines will be using the
> > size of the node and some machines will use the size of an LLC. It's less
> > predictable overall as some machines will be "special" relative to others
> > making it harder to reproduce certain problems locally but it would take
> > imbalance_pct into account in a way that you're happy with.
> >
> > Also bear in mind that whether LLC is accounted for or not, the final
> > result should be halved similar to the other imbalance calculations to
> > avoid over or under load balancing.
> >
> > > Or you could use the util_avg so you will take into account if the
> > > tasks are short running ones or long running ones
> > >
> >
> > util_avg can be skewed if there are big outliers. Even then, it's not
> > a great metric for the low utilisation cutoff. Large numbers of mostly
> > idle but running tasks would be treated similarly to small numbers of
> > fully active tasks. It's less predictable and harder to reason about how
>
> Yes but this also have the advantage of reflecting more accurately how
> the system is used.
> with nr_running, we consider that mostly idle and fully active tasks
> will have the exact same impact on the memory
>

Maybe, maybe not. When there is spare capacity in the domain overall and
we are only interested in the low utilisation case, it seems to me that
we should consider the most obvious and understandable metric. Now, if we
were talking about a nearly fully loaded domain or an overloaded domain
then I would fully agree with you as balancing utilisation in that case
becomes critical.

> > load balancing behaves across a variety of workloads.
> >
> > Based on what you suggest, the result looks like this (build tested
> > only)
>
> I'm going to make a try of this patch
>

Thanks. I've queued the same patch on one machine to see what falls out.
I don't want to tie up half my test grid until we get some sort of
consensus.

--
Mel Gorman
SUSE Labs