Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()"

From: Vincent Guittot
Date: Mon Oct 26 2020 - 10:57:18 EST


On Mon, 26 Oct 2020 at 15:38, Rik van Riel <riel@xxxxxxxxxxx> wrote:
>
> On Mon, 2020-10-26 at 15:24 +0100, Vincent Guittot wrote:
> > Le lundi 26 oct. 2020 à 08:45:27 (-0400), Chris Mason a écrit :
> > > On 26 Oct 2020, at 4:39, Vincent Guittot wrote:
> > >
> > > > Hi Chris
> > > >
> > > > On Sat, 24 Oct 2020 at 01:49, Chris Mason <clm@xxxxxx> wrote:
> > > > > Hi everyone,
> > > > >
> > > > > We’re validating a new kernel in the fleet, and compared with
> > > > > v5.2,
> > > >
> > > > Which version are you using ?
> > > > several improvements have been added since v5.5 and the rework of
> > > > load_balance
> > >
> > > We’re validating v5.6, but all of the numbers referenced in this
> > > patch are
> > > against v5.9. I usually try to back port my way to victory on this
> > > kind of
> > > thing, but mainline seems to behave exactly the same as
> > > 0b0695f2b34a wrt
> > > this benchmark.
> >
> > ok. Thanks for the confirmation
> >
> > I have been able to reproduce the problem on my setup.
> >
> > Could you try the fix below ?
> >
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9049,7 +9049,8 @@ static inline void calculate_imbalance(struct
> > lb_env *env, struct sd_lb_stats *s
> > * emptying busiest.
> > */
> > if (local->group_type == group_has_spare) {
> > - if (busiest->group_type > group_fully_busy) {
> > + if ((busiest->group_type > group_fully_busy) &&
> > + (busiest->group_weight > 1)) {
> > /*
> > * If busiest is overloaded, try to fill
> > spare
> > * capacity. This might end up creating spare
> > capacity
> >
> >
> > When we calculate an imbalance at te smallest level, ie between CPUs
> > (group_weight == 1),
> > we should try to spread tasks on cpus instead of trying to fill spare
> > capacity.
>
> Should we also spread tasks when balancing between
> multi-threaded CPU cores on the same socket?

My explanation is probably misleading. In fact we already try to
spread tasks. we just use spare capacity instead of nr_running when
there is more than 1 CPU in the group and the group is overloaded.
Using spare capacity is a bit more conservative because it tries to
not pull more utilization than spare capacity

>
> Say we have groups of CPUs
> (0, 2) and CPUs (1, 3),
> with CPU 2 idle, and 3 tasks spread between CPUs
> 1 & 3.
>
> Since they are all on the same LLC, and the task
> wakeup code has absolutely no hesitation in moving
> them around, should the load balancer also try to
> keep tasks within a socket spread across all CPUs?
>
> --
> All Rights Reversed.