Re: [PATCH 2/2 v2] sched: use load_avg for selecting idlest group

From: Morten Rasmussen
Date: Wed Nov 30 2016 - 09:23:38 EST


On Wed, Nov 30, 2016 at 02:49:11PM +0100, Vincent Guittot wrote:
> On 30 November 2016 at 13:49, Morten Rasmussen <morten.rasmussen@xxxxxxx> wrote:
> > On Fri, Nov 25, 2016 at 04:34:33PM +0100, Vincent Guittot wrote:
> >> find_idlest_group() only compares the runnable_load_avg when looking for
> >> the least loaded group. But on fork intensive use case like hackbench
>
> [snip]
>
> >> + min_avg_load = avg_load;
> >> + idlest = group;
> >> + } else if ((runnable_load < (min_runnable_load + imbalance)) &&
> >> + (100*min_avg_load > imbalance_scale*avg_load)) {
> >> + /*
> >> + * The runnable loads are close so we take
> >> + * into account blocked load through avg_load
> >> + * which is blocked + runnable load
> >> + */
> >> + min_avg_load = avg_load;
> >> idlest = group;
> >> }
> >>
> >> @@ -5470,13 +5495,16 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> >> goto no_spare;
> >>
> >> if (this_spare > task_util(p) / 2 &&
> >> - imbalance*this_spare > 100*most_spare)
> >> + imbalance_scale*this_spare > 100*most_spare)
> >> return NULL;
> >> else if (most_spare > task_util(p) / 2)
> >> return most_spare_sg;
> >>
> >> no_spare:
> >> - if (!idlest || 100*this_load < imbalance*min_load)
> >> + if (!idlest ||
> >> + (min_runnable_load > (this_runnable_load + imbalance)) ||
> >> + ((this_runnable_load < (min_runnable_load + imbalance)) &&
> >> + (100*min_avg_load > imbalance_scale*this_avg_load)))
> >
> > I don't get why you have imbalance_scale applied to this_avg_load and
> > not min_avg_load. IIUC, you end up preferring non-local groups?
>
> In fact, I have keep the same condition that is used when looping the group.

The logic is inverted compared to the group loop as there you after
picking a better group. There it makes sense that you accept groups with
a slightly higher runnable_load if they have a much better avg_load.

Here you are after rejecting 'idlest' group if it isn't significantly
better than the local group, so the conditions have to be inverted.

> You're right that we should prefer local rq if avg_load are close and
> test the condition
> (100*this_avg_load > imbalance_scale*min_avg_load) instead

I would argue you should switch the inequality operator around:

(100*this_avg_load < imbalance_scale*min_avg_load)

So it becomes true unless min_avg_load is significantly less than
this_avg_load meaning that we will ignore 'idlest' and return NULL. This
is also in line with old condition (100*this_load < imbalance*min_load).

> >
> > If we take the example where this_runnable_load == min_runnable_load and
> > this_avg_load == min_avg_load. In this case, and in cases where
> > min_avg_load is slightly bigger than this_avg_load, we end up picking
> > the 'idlest' group even if the local group is equally good or even
> > slightly better?
> >
> >> return NULL;
> >> return idlest;
> >> }
> >
> > Overall, I like that load_avg is being brought in to make better
> > decisions. The variable naming is a bit confusing. For example,
> > runnable_load is a capacity-average just like avg_load. 'imbalance' is
> > now an absolute capacity-average margin, but it is hard to come up with
> > better short alternatives.
> >
> > Although 'imbalance' is based on the existing imbalance_pct, I find
> > somewhat arbitrary. Why is (imbalance_pct-100)*1024/100 a good absolute
> > margin to define the interval where we want to consider load_avg? I
> > guess it is case of 'we had to pick some value', which we have done in
> > many other places. Though, IMHO, it is a bit strange that imbalance_pct
> > is used in two different ways to bias comparison in the same function.
>
> I see imbalance_pct like the definition of the acceptable imbalance %
> for a sched_domain. This % is then used against the current load or to
> define an absolute value.
>
> > It used to be only used as a scaling factor (now imbalance_scale), while
> > this patch proposes to use it for computing an absolute margin
> > (imbalance) as well. It is not major issue, but it is not clear why it
> > is used differently to compare two metrics that are relatively closely
> > related.
>
> In fact, scaling factor (imbalance) doesn't work well with small
> value. As an example, the use of a scaling factor fails as soon as
> this_runnable_load = 0 because we always selected local rq even if
> min_runnable_load is only 1 which doesn't really make sense because
> they are just the same.

Agreed, using the operator for scaling is not ideal for low load
scenarios. The spare-capacity checking is fixing that issue for some
scenarios but not all.

Could we mention these points somewhere so people can be reminded later?
In the commit message, if not as a comment in the code?