Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains v2

From: Mel Gorman
Date: Mon Jan 06 2020 - 09:52:33 EST


Sorry I sent out v3 before seeing this email as my mail only synchronises
periodically.

On Mon, Jan 06, 2020 at 02:55:00PM +0100, Vincent Guittot wrote:
> > - return;
> > - }
> > + } else {
> >
> > - /*
> > - * If there is no overload, we just want to even the number of
> > - * idle cpus.
> > - */
> > - env->migration_type = migrate_task;
> > - env->imbalance = max_t(long, 0, (local->idle_cpus -
> > + /*
> > + * If there is no overload, we just want to even the number of
> > + * idle cpus.
> > + */
> > + env->migration_type = migrate_task;
> > + env->imbalance = max_t(long, 0, (local->idle_cpus -
> > busiest->idle_cpus) >> 1);
> > + }
> > +
> > + /* Consider allowing a small imbalance between NUMA groups */
> > + if (env->sd->flags & SD_NUMA) {
> > + long imbalance_adj, imbalance_max;
> > +
> > + /*
> > + * imbalance_adj is the allowable degree of imbalance
> > + * to exist between two NUMA domains. imbalance_pct
> > + * is used to estimate the number of active tasks
> > + * needed before memory bandwidth may be as important
> > + * as memory locality.
> > + */
> > + imbalance_adj = (100 / (env->sd->imbalance_pct - 100)) - 1;
>
> This looks weird to me because you use imbalance_pct, which is
> meaningful only compare a ratio, to define a number that will be then
> compared to a number of tasks without taking into account the weight
> of the node. So whatever the node size, 32 or 128 CPUs, the
> imbalance_adj will be the same: 3 with the default imbalance_pct of
> NUMA level which is 125 AFAICT
>

The intent in this version was to only cover the low utilisation case
regardless of the NUMA node size. There were too many corner cases
where the tradeoff of local memory latency versus local memory bandwidth
cannot be quantified. See Srikar's report as an example but it's a general
problem. The use of imbalance_pct was simply to find the smallest number of
running tasks were (imbalance_pct - 100) would be 1 running task and limit
the patch to address the low utilisation case only. It could be simply
hard-coded but that would ignore cases where an architecture overrides
imbalance_pct. I'm open to suggestion on how we could identify the point
where imbalances can be ignored without hitting other corner cases.

> > +
> > + /*
> > + * Allow small imbalances when the busiest group has
> > + * low utilisation.
> > + */
> > + imbalance_max = imbalance_adj << 1;
>
> Why do you add this shift ?
>

For very low utilisation, there is no balancing between nodes. For slightly
above that, there is limited balancing. After that, the load balancing
behaviour is unchanged as I believe we cannot determine if memory latency
or memory bandwidth is more important for arbitrary workloads.

--
Mel Gorman
SUSE Labs