Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs

From: Morten Rasmussen
Date: Fri Mar 27 2015 - 13:56:20 EST


On Fri, Mar 27, 2015 at 04:46:30PM +0000, Preeti U Murthy wrote:
> Hi Morten,
>
> On 03/27/2015 08:08 PM, Morten Rasmussen wrote:
> > Hi Preeti,
> >
> > On Thu, Mar 26, 2015 at 01:02:44PM +0000, Preeti U Murthy wrote:
> >> Fix this, by checking if a CPU was woken up to do nohz idle load
> >> balancing, before it does load balancing upon itself. This way we allow
> >> idle CPUs across the system to do load balancing which results in
> >> quicker spread of load, instead of performing load balancing within the
> >> local sched domain hierarchy of the ILB CPU alone under circumstances
> >> such as above.
> >>
> >> Signed-off-by: Preeti U Murthy <preeti@xxxxxxxxxxxxxxxxxx>
> >> ---
> >> Changes from V1:
> >> 1. Added relevant comments
> >> 2. Wrapped lines to a fixed width in the changelog
> >>
> >> kernel/sched/fair.c | 8 +++++---
> >> 1 file changed, 5 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index bcfe320..8b6d0d5 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -7660,14 +7660,16 @@ static void run_rebalance_domains(struct softirq_action *h)
> >> enum cpu_idle_type idle = this_rq->idle_balance ?
> >> CPU_IDLE : CPU_NOT_IDLE;
> >>
> >> - rebalance_domains(this_rq, idle);
> >> -
> >> /*
> >> * If this cpu has a pending nohz_balance_kick, then do the
> >> * balancing on behalf of the other idle cpus whose ticks are
> >> - * stopped.
> >> + * stopped. Do nohz_idle_balance *before* rebalance_domains to
> >> + * give the idle cpus a chance to load balance. Else we may
> >> + * load balance only within the local sched_domain hierarchy
> >> + * and abort nohz_idle_balance altogether if we pull some load.
> >> */
> >> nohz_idle_balance(this_rq, idle);
> >> + rebalance_domains(this_rq, idle);
> >
> > IIUC, this change means that you will always wake up one more cpu than
> > necessary unless you have enough work for all cpus in the system. For
> > example, cpu0 is busy with two tasks and cpu1+2 are nohz_idle. cpu0
> > kicks cpu1 to do a nohz_idle_balance(). With the change it will balance
> > on behalf of cpu2 first and pull one of the tasks from cpu0. When done
> > with nohz_idle_balance() cpu1 has nothing left to pull when balancing
> > itself and goes back to sleep.
> >
> > My concern is that this will increase the number of cpu wakeups quite
> > significantly. Am I missing something?
>
> Its true that we wakeup all idle CPUs. But I think we are justified in
> doing so, given that nohz_idle_balance() was deemed necessary. The logic
> behind nohz_idle_balance() as I see it is that, all idle CPUs should be
> brought to action when some scheduling group is found to be busy. With
> the current code that does not happen if one of them happen to pull
> tasks. This logic does not make sense to me.
>
> With the current code, I think it is hard to estimate how many idle CPU
> wakeups would be sufficient to balance out the system load. But I
> certainly feel that waking up all of them to perform the load balancing
> that was asked for, is far better than waking up none of them. This is
> in favor of performance. I agree the extra wakeups would do us harm with
> power savings, but I would still be fine with it, given the undesirable
> scenario that occurs as a consequence, as described in the changelog.

I agree that the current behaviour is undesirable and should be fixed,
but IMHO waking up all idle cpus can not be justified. It is only one
additional cpu though with your patch so it isn't quite that bad.

I agree that it is hard to predict how many additional cpus you need,
but I don't think you necessarily need that information as long as you
start by filling up the cpu that was kicked to do the
nohz_idle_balance() first.

You would also solve your problem if you removed the ability for the cpu
to bail out after balancing itself and force it to finish the job. It
would mean harming tasks that where pulled to the balancing cpu as they
would have to wait being scheduling until the nohz_idle_balance() has
completed. It could be a price worth paying.

An alternative could be to let the balancing cpu balance itself first
and bail out as it currently does, but let it kick the next nohz_idle
cpu to continue the job if it thinks there is more work to be done. So
you would get a chain of kicks that would stop when there nothing
more to do be done. It isn't quite as fast as your solution as it would
require an IPI plus wakeup for each cpu to continue the work. But it
should be much faster than the current code I think.

IMHO it makes more sense to stay with the current scheme of ensuring
that the kicked cpu is actually used before waking up more cpus and
instead improve how additional cpus are kicked if they are needed.

Reducing unnecessary wakeups is quite important for energy consumption
and something a lot of effort is put into. You really don't want to wake
up another cluster/package unnecessarily just because there was only one
nohz-idle cpu left in the previous one which could have handled the
additional load. It gets even worse if the other cluster is less
energy-efficient (big.LITTLE).

Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/