Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs

From: Preeti U Murthy
Date: Mon Mar 30 2015 - 03:26:45 EST


Hi Morten,

On 03/27/2015 11:26 PM, Morten Rasmussen wrote:
>
> I agree that the current behaviour is undesirable and should be fixed,
> but IMHO waking up all idle cpus can not be justified. It is only one
> additional cpu though with your patch so it isn't quite that bad.
>
> I agree that it is hard to predict how many additional cpus you need,
> but I don't think you necessarily need that information as long as you
> start by filling up the cpu that was kicked to do the
> nohz_idle_balance() first.
>
> You would also solve your problem if you removed the ability for the cpu
> to bail out after balancing itself and force it to finish the job. It
> would mean harming tasks that where pulled to the balancing cpu as they
> would have to wait being scheduling until the nohz_idle_balance() has
> completed. It could be a price worth paying.

But how would this prevent waking up idle CPUs ? You still end up waking
up all idle CPUs, wouldn't you?

>
> An alternative could be to let the balancing cpu balance itself first
> and bail out as it currently does, but let it kick the next nohz_idle
> cpu to continue the job if it thinks there is more work to be done. So
> you would get a chain of kicks that would stop when there nothing
> more to do be done. It isn't quite as fast as your solution as it would

I am afraid there is more to this. If a given CPU is unable to pull
tasks, it could mean that it is an unworthy destination CPU. But it does
not mean that the other idle CPUs are unworthy of balancing too.

So if the ILB CPU stops waking up idle CPUs when it has nothing to pull,
we will end up hurting load balancing. Take for example the scenario
described in the changelog. The idle CPUs within a numa node may find
load balanced within themselves and hence refrain from pulling any load.
If these ILB CPUs stop nohz idle load balancing at this point, the load
will never get spread across nodes.

If on the other hand, if we keep kicking idle CPUs to carry on idle load
balancing, the wakeup scenario will be no better than it is with this patch.

> require an IPI plus wakeup for each cpu to continue the work. But it
> should be much faster than the current code I think.
>
> IMHO it makes more sense to stay with the current scheme of ensuring
> that the kicked cpu is actually used before waking up more cpus and
> instead improve how additional cpus are kicked if they are needed.

It looks more sensible to do this in parallel. The scenario on POWER is
that tasks don't spread out across nodes until 10s of fork. This is
unforgivable and we cannot afford the code to be the way it is today.

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/