Re: [PATCH V2] sched: Improve load balancing in the presence of idle CPUs
From: Preeti U Murthy
Date: Tue Mar 31 2015 - 04:06:44 EST
On 03/30/2015 07:15 PM, Vincent Guittot wrote:
> On 26 March 2015 at 14:02, Preeti U Murthy <preeti@xxxxxxxxxxxxxxxxxx> wrote:
>> When a CPU is kicked to do nohz idle balancing, it wakes up to do load
>> balancing on itself, followed by load balancing on behalf of idle CPUs.
>> But it may end up with load after the load balancing attempt on itself.
>> This aborts nohz idle balancing. As a result several idle CPUs are left
>> without tasks till such a time that an ILB CPU finds it unfavorable to
>> pull tasks upon itself. This delays spreading of load across idle CPUs
>> and worse, clutters only a few CPUs with tasks.
>>
>> The effect of the above problem was observed on an SMT8 POWER server
>> with 2 levels of numa domains. Busy loops equal to number of cores were
>> spawned. Since load balancing on fork/exec is discouraged across numa
>> domains, all busy loops would start on one of the numa domains. However
>> it was expected that eventually one busy loop would run per core across
>> all domains due to nohz idle load balancing. But it was observed that it
>> took as long as 10 seconds to spread the load across numa domains.
>
> 10sec is quite long. Have you checked how many load balance is needed
> to spread the load on the system ? Are you using the default
The issue was that load balancing was not even being *attempted* due to
the above mentioned reason. The ILB CPU would pull load and abort
nohz_idle_ld_bal which was the only way load balancing could be
triggered on the idle CPUs. So it would take long to call load balancing
on idle CPUs after which it was quick to spread the load. There was
after all a stark imbalance between the load across the nodes.
> min_interval and max_interval ?
> The default period range is [sd_weight : 2*sd_weight] I don't know how
> many CPUs you have but as an example, a system made of 128 CPUs will
> do a load balance across the wide system each 128 jifffies if the CPU
> is idle and each 4096 jiffies on a busy CPU. This could explain why
> you need so much time to spread task across the system.
Yes I did suspect this in the beginning but I could reproduce the
problem even on a machine with few cores.
Regards
Preeti U Murthy
>
> Vincent
>
>>
>> Further investigation showed that this was a consequence of the
>> following:
>>
>> 1. An ILB CPU was chosen from the first numa domain to trigger nohz idle
>> load balancing [Given the experiment, upto 6 CPUs per core could be
>> potentially idle in this domain.]
>>
>> 2. However the ILB CPU would call load_balance() on itself before
>> initiating nohz idle load balancing.
>>
>> 3. Given cores are SMT8, the ILB CPU had enough opportunities to pull
>> tasks from its sibling cores to even out load.
>>
>> 4. Now that the ILB CPU was no longer idle, it would abort nohz idle
>> load balancing
>>
>> As a result the opportunities to spread load across numa domains were
>> lost until such a time that the cores within the first numa domain had
>> equal number of tasks among themselves. This is a pretty bad scenario,
>> since the cores within the first numa domain would have as many as 4
>> tasks each, while cores in the neighbouring numa domains would all
>> remain idle.
>>
>> Fix this, by checking if a CPU was woken up to do nohz idle load
>> balancing, before it does load balancing upon itself. This way we allow
>> idle CPUs across the system to do load balancing which results in
>> quicker spread of load, instead of performing load balancing within the
>> local sched domain hierarchy of the ILB CPU alone under circumstances
>> such as above.
>>
>> Signed-off-by: Preeti U Murthy <preeti@xxxxxxxxxxxxxxxxxx>
>> ---
>> Changes from V1:
>> 1. Added relevant comments
>> 2. Wrapped lines to a fixed width in the changelog
>>
>> kernel/sched/fair.c | 8 +++++---
>> 1 file changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index bcfe320..8b6d0d5 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -7660,14 +7660,16 @@ static void run_rebalance_domains(struct softirq_action *h)
>> enum cpu_idle_type idle = this_rq->idle_balance ?
>> CPU_IDLE : CPU_NOT_IDLE;
>>
>> - rebalance_domains(this_rq, idle);
>> -
>> /*
>> * If this cpu has a pending nohz_balance_kick, then do the
>> * balancing on behalf of the other idle cpus whose ticks are
>> - * stopped.
>> + * stopped. Do nohz_idle_balance *before* rebalance_domains to
>> + * give the idle cpus a chance to load balance. Else we may
>> + * load balance only within the local sched_domain hierarchy
>> + * and abort nohz_idle_balance altogether if we pull some load.
>> */
>> nohz_idle_balance(this_rq, idle);
>> + rebalance_domains(this_rq, idle);
>> }
>>
>> /*
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/