Re: [PATCH] sched, fair: Allow a small load imbalance between low utilisation SD_NUMA domains v3

From: Mel Gorman
Date: Tue Jan 07 2020 - 05:16:54 EST

On Tue, Jan 07, 2020 at 10:43:08AM +0100, Vincent Guittot wrote:
> > > > > It's not directly related to the number of CPUs in the node. Are you
> > > > > thinking of busiest->group_weight?
> > > >
> > > > I am, because as it is right now that if condition
> > > > looks like it might never be true for imbalance_pct 115.
> > > >
> > > > Presumably you put that check there for a reason, and
> > > > would like it to trigger when the amount by which a node
> > > > is busy is less than 2 * (imbalance_pct - 100).
> > >
> > >
> > > If three per cent can make any sense in helping determine utilisation
> > > low then the busy load has to meet
> > >
> > > busiest->sum_nr_running < max(3, cpus in the node / 32);
> > >
> >
> > Why 3% and why would the low utilisation cut-off depend on the number of
> But in the same way, why only 6 tasks ? which is the value with
> default imbalance_pct ?

I laid this out in another mail sent based on timing so I would repeat
myself other than to say it's predictable across machines.

> I expect a machine with 128 CPUs to have more bandwidth than a machine
> with only 32 CPUs and as a result to allow more imbalance

I would expect so too with the caveat that there can be more memory
channels within a node so positioning does matter but we can't take
everything into account without creating a convulated mess. Worse, we have
no decent method for estimating bandwidth as it depends on the reference
pattern and scheduler domains do not currently take memory channels into
account. Maybe they should but that's a whole different discussion that
we do not want to get into right now.

> Maybe the number of running tasks (or idle cpus) is not the right
> metrics to choose if we can allow a small degree of imbalance because
> this doesn't take into account it wether the tasks are long running or
> short running ones

I think running tasks at least is the least bad metric. idle CPUs gets
caught up in corner cases with bindings and util_avg can be skewed by
outliers. Running tasks is a sensible starting point until there is a
concrete use case that shows it is unworkable. Lets see what you think of
the other untested patch I posted that takes the group weight and child
domain weight into account.

Mel Gorman