Re: [QUERY] Confusing usage of rq->nr_running in load balancing

From: Preeti U Murthy
Date: Mon Sep 15 2014 - 00:17:01 EST

Next message: Jaehoon Chung: "Re: [PATCH v5 0002/0003] mmc: Replace "enhanced_area_en" attribute by "partition_setting_completed""
Previous message: Grant Likely: "Re: [RFC PATCH for Juno 1/2] net: smsc911x add support for probing from ACPI"
In reply to: Preeti U Murthy: "Re: [QUERY] Confusing usage of rq->nr_running in load balancing"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Peter, Vincent,

On 09/03/2014 10:28 PM, Vincent Guittot wrote:
> On 3 September 2014 14:21, Preeti U Murthy <preeti@xxxxxxxxxxxxxxxxxx> wrote:
>> Hi,
>
> Hi Preeti,
>
>>
>> There are places in kernel/sched/fair.c in the load balancing part where
>> rq->nr_running is used as against cfs_rq->nr_running. At least I could
>> not make out why the former was used in the following scenarios.
>> It looks to me that it can very well lead to incorrect load balancing.
>> Also I did not pay attention to the numa balancing part of the code
>> while skimming through this file to catch this scenario. There are a
>> couple of places there too which need to be scrutinized.
>>
>> 1. load_balance(): The check (busiest->nr_running > 1)
>> The load balancing would be futile if there are tasks of other
>> scheduling classes, wouldn't it?
>
> agree with you
>
>>
>> 2. active_load_balance_cpu_stop(): A similar check and a similar
>> consequence as 1 here.
>
> agree with you
>
>>
>> 3. nohz_kick_needed() : We check for more than one task on the runqueue
>> and hence trigger load balancing even if there are rt-tasks.
>
> I can see one potentiel reason why rq->nr_running is interesting that
> is the group capacity might have changed because of non cfs tasks
> since last load balance. So we need to monitor the change of the
> groups' capacity to ensure that the average load of each group is
> still in the same level

I tried a patch which changes nr_running to cfs.h_nr_running in the
above three scenarios and found that the performance of the workload
*drops significantly*. The workload that I ran was ebizzy with a few
threads running at rt priority and few running at normal priority and
running in parallel. This was tried on a 16 core SMT-8 machine. The drop
in the performance was around 18% with the patch across different number
of threads.

I figured that it was because if we consider only cfs.h_nr_running in
the above cases, we reduce load balancing attempts even when the
capacity of the cpus to run fair tasks is significantly reduced. For
example if the cpu is running two rt tasks and one fair task, we skip
load balancing altogether with the patch. Besides this, we may end up
doing active load balancing too often.

So I think we are good with nr_running although that may mean
unnecessary load balancing attempts when only rt tasks are running on
the cpus. But evaluating on nr_running in the above three scenarios
when there is a mix of rt and fair tasks is better so as to see if the
cpus have enough capacity to handle the one fair task that they can
possibly be running (If there are more fair tasks, we load balance anyway).

As for the usage of nr_running in find_busiest_queue(), we are good
there as Vincent pointed out as below.

"
>
> 8. find_busiest_queue(): This anomaly shows up when we filter against
> rq->nr_running == 1 and imbalance cannot be taken care of by the
> existing task on this rq.

agree with you even if the test with wl should prevent wrong decision
as a wl will be null if no cfs task are present

"

So the only changes we require around this is the change of nr_running
to cfs.h_nr_running in update_sg_lb_stats() and cpu_avg_load_per_task()
which is being done by Vincent already in the consolidation of
cpu_capacity patches and I did not see regressions there during my tests.

Thanks

Regards
Preeti U Murthy

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Jaehoon Chung: "Re: [PATCH v5 0002/0003] mmc: Replace "enhanced_area_en" attribute by "partition_setting_completed""
Previous message: Grant Likely: "Re: [RFC PATCH for Juno 1/2] net: smsc911x add support for probing from ACPI"
In reply to: Preeti U Murthy: "Re: [QUERY] Confusing usage of rq->nr_running in load balancing"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]