Re: [PATCH] sched, fair: Allow a small degree of load imbalance between SD_NUMA domains
From: Mel Gorman
Date: Thu Dec 19 2019 - 10:58:23 EST
On Thu, Dec 19, 2019 at 04:41:17PM +0100, Vincent Guittot wrote:
> > I don't understand. If I move SD_NUMA checks above the imbalance
> > calculation, how do I know whether the imbalance should be ignored?
>
> You are only clearing env->imbalance before returning if the condition
> between sum_nr_running with weight doesn't match so you don't care about
> what will be the value of env->imbalance in the other case so you can have
>
> if ((env->sd->flags & SD_NUMA) &&
> ( allow some degrees of imbalance )) {
> env->imbalance = 0
> return;
> }
>
> if (busiest->group_weight == 1 || sds->prefer_sibling) {
> unsigned int nr_diff = busiest->sum_nr_running;
> /*
> * When prefer sibling, evenly spread running tasks on
> * groups.
> */
> env->migration_type = migrate_task;
> lsub_positive(&nr_diff, local->sum_nr_running);
> env->imbalance = nr_diff >> 1;
> return;
> }
>
> /*
> * If there is no overload, we just want to even the number of
> * idle cpus.
> */
> env->migration_type = migrate_task;
> env->imbalance = max_t(long, 0, (local->idle_cpus -
> busiest->idle_cpus) >> 1);
> return;
> }
>
Ok, that's clear. In an earlier version, I was not just resetting it, I
was adjusting the estimated imbalance but it was fragile. I can move the
check as you suggest.
> > > > @@ -8690,6 +8686,38 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > > > env->migration_type = migrate_task;
> > > > env->imbalance = max_t(long, 0, (local->idle_cpus -
> > > > busiest->idle_cpus) >> 1);
> > > > +
> > > > +out_spare:
> > > > + /*
> > > > + * Whether balancing the number of running tasks or the number
> > > > + * of idle CPUs, consider allowing some degree of imbalance if
> > > > + * migrating between NUMA domains.
> > > > + */
> > > > + if (env->sd->flags & SD_NUMA) {
> > > > + unsigned int imbalance_adj, imbalance_max;
> > > > +
> > > > + /*
> > > > + * imbalance_adj is the allowable degree of imbalance
> > > > + * to exist between two NUMA domains. It's calculated
> > > > + * relative to imbalance_pct with a minimum of two
> > > > + * tasks or idle CPUs.
> > > > + */
> > > > + imbalance_adj = (busiest->group_weight *
> > > > + (env->sd->imbalance_pct - 100) / 100) >> 1;
> > > > + imbalance_adj = max(imbalance_adj, 2U);
> > > > +
> > > > + /*
> > > > + * Ignore imbalance unless busiest sd is close to 50%
> > > > + * utilisation. At that point balancing for memory
> > > > + * bandwidth and potentially avoiding unnecessary use
> > > > + * of HT siblings is as relevant as memory locality.
> > > > + */
> > > > + imbalance_max = (busiest->group_weight >> 1) - imbalance_adj;
> > > > + if (env->imbalance <= imbalance_adj &&
> > > > + busiest->sum_nr_running < imbalance_max) {i
> > >
> > > Shouldn't you consider the number of busiest->idle_cpus instead of the busiest->sum_nr_running ?
> > >
> >
> > Why? CPU affinity could have stacked multiple tasks on one CPU where
> > as I'm looking for a proxy hint on the amount of bandwidth required.
> > sum_nr_running does not give me an accurate estimate but it's better than
> > idle cpus.
>
> Because even if you have multiple tasks on one CPU, only one will run at a
> time on the CPU and others will wait so the bandwidth is effectively link to
> the number of running CPUs more than number of runnable tasks
>
Ok, I can try that. There is the corner case where tasks can be stacked on
the one CPU without affinities being involved but it's rare. That might
change again in the future if unbound workqueues get special cased to
allow a wakee to stack on top of the workqueues CPU as it's essentially
sync but it's not likely to make that much of a difference.
--
Mel Gorman
SUSE Labs