Re: [RFC PATCH 0/5] enable runnable load avg in load balance

From: Alex Shi
Date: Tue Nov 27 2012 - 03:08:12 EST


On 11/27/2012 02:45 PM, Preeti U Murthy wrote:
> Hi,
> On 11/27/2012 11:44 AM, Alex Shi wrote:
>> On 11/27/2012 11:08 AM, Preeti U Murthy wrote:
>>> Hi everyone,
>>>
>>> On 11/27/2012 12:33 AM, Benjamin Segall wrote:
>>>> So, I've been trying out using the runnable averages for load balance in
>>>> a few ways, but haven't actually gotten any improvement on the
>>>> benchmarks I've run. I'll post my patches once I have the numbers down,
>>>> but it's generally been about half a percent to 1% worse on the tests
>>>> I've tried.
>>>>
>>>> The basic idea is to use (cfs_rq->runnable_load_avg +
>>>> cfs_rq->blocked_load_avg) (which should be equivalent to doing
>>>> load_avg_contrib on the rq) for cfs_rqs and possibly the rq, and
>>>> p->se.load.weight * p->se.avg.runnable_avg_sum / period for tasks.
>>>
>>> Why should cfs_rq->blocked_load_avg be included to calculate the load
>>> on the rq? They do not contribute to the active load of the cpu right?
>>>
>>> When a task goes to sleep its load is removed from cfs_rq->load.weight
>>> as well in account_entity_dequeue(). Which means the load balancer
>>> considers a sleeping entity as *not* contributing to the active runqueue
>>> load.So shouldn't the new metric consider cfs_rq->runnable_load_avg alone?
>>>>
>>>> I have not yet tried including wake_affine, so this has just involved
>>>> h_load (task_load_down and task_h_load), as that makes everything
>>>> (besides wake_affine) be based on either the new averages or the
>>>> rq->cpu_load averages.
>>>>
>>>
>>> Yeah I have been trying to view the performance as well,but with
>>> cfs_rq->runnable_load_avg as the rq load contribution and the task load,
>>> same as mentioned above.I have not completed my experiments but I would
>>> expect some significant performance difference due to the below scenario:
>>>
>>> Task3(10% task)
>>> Task1(100% task) Task4(10% task)
>>> Task2(100% task) Task5(10% task)
>>> --------------- ---------------- ----------
>>> CPU1 CPU2 CPU3
>>>
>>> When cpu3 triggers load balancing:
>>>
>>> CASE1:
>>> without PJT's metric the following loads will be perceived
>>> CPU1->2048
>>> CPU2->3042
>>> Therefore CPU2 might be relieved of one task to result in:
>>>
>>>
>>> Task1(100% task) Task4(10% task)
>>> Task2(100% task) Task5(10% task) Task3(10% task)
>>> --------------- ---------------- ----------
>>> CPU1 CPU2 CPU3
>>>
>>> CASE2:
>>> with PJT's metric the following loads will be perceived
>>> CPU1->2048
>>> CPU2->1022
>>> Therefore CPU1 might be relieved of one task to result in:
>>>
>>> Task3(10% task)
>>> Task4(10% task)
>>> Task2(100% task) Task5(10% task) Task1(100% task)
>>> --------------- ---------------- ----------
>>> CPU1 CPU2 CPU3
>>>
>>>
>>> The differences between the above two scenarios include:
>>>
>>> 1.Reduced latency for Task1 in CASE2,which is the right task to be moved
>>> in the above scenario.
>>>
>>> 2.Even though in the former case CPU2 is relieved of one task,its of no
>>> use if Task3 is going to sleep most of the time.This might result in
>>> more load balancing on behalf of cpu3.
>>>
>>> What do you guys think?
>>
>> It looks fine. just a question of CASE 1.
>> Usually the cpu2 with 3 10% load task will show nr_running == 0, at 70%
>> time. So, how you make rq->nr_running = 3 always?
>>
>> Guess in most chance load balance with pull task1 or task2 to cpu2 or
>> cpu3. not the result of CASE 1.
>
> Thats right Alex.Most of the time the nr_running on CPU2 will be shown
> to be 0 or perhaps 1/2.But whether you use PJT's metric or not,the load
> balancer in such circumstances will behave the same, as you have rightly
> pointed out: pull task1/2 to cpu2/3.
>
> But the issue usually arises when all three wake up at the same time on
> cpu2,portraying wrongly that the load is 3042, if PJT's metric is not
> used.This could lead to load balancing one of these short running tasks
> as shown by CASE1.This is the situation where in my opinion,PJT's metric
> could make a difference.

Sure. And it will be perfect if you can find a appropriate benchmark to
support it.
>
> Regards
> Preeti U Murthy
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/