Re: [RFCv3 PATCH 12/48] sched: Make usage tracking cpu scale-invariant
From: Dietmar Eggemann
Date: Wed May 06 2015 - 05:53:19 EST
On 03/05/15 07:27, pang.xunlei@xxxxxxxxxx wrote:
> Hi Dietmar,
>
> Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote 2015-03-24 AM 03:19:41:
>>
>> Re: [RFCv3 PATCH 12/48] sched: Make usage tracking cpu scale-invariant
[...]
>> In the previous patch-set https://lkml.org/lkml/2014/12/2/332we
>> cpu-scaled both (sched_avg::runnable_avg_sum (load) and
>> sched_avg::running_avg_sum (utilization)) but during the review Vincent
>> pointed out that a cpu-scaled invariant load signal messes up
>> load-balancing based on s[dg]_lb_stats::avg_load in overload scenarios.
>>
>> avg_load = load/capacity and load can't be simply replaced here by
>> 'cpu-scale invariant load' (which is load*capacity).
>
> I can't see why it shouldn't.
>
> For "avg_load = load/capacity", "avg_load" stands for how busy the cpu
> works,
> it is actually a value relative to its capacity. The system is seen
> balanced
> for the case that a task runs on a 512-capacity cpu contributing 50% usage,
> and two the same tasks run on the 1024-capacity cpu contributing 50% usage.
> "capacity" in this formula contains uarch capacity, "load" in this formula
> must be an absolute real load, not relative.
>
> But with current kernel implementation, "load" computed without this patch
> is a relative value. For example, one task (1024 weight) runs on a 1024
> capacity CPU, it gets 256 load contribution(25% on this CPU). When it runs
> on a 512 capacity CPU, it will get the 512 load contribution(50% on ths
> CPU).
> See, currently runnable "load" is relative, so "avg_load" is actually wrong
> and its value equals that of "load". So I think the runnable load should be
> made cpu scale-invariant as well.
>
> Please point me out if I was wrong.
Cpu-scaled load leads to wrong lb decisions in overload scenarios:
(1) Overload example taken from email thread between Vincent and Morten:
https://lkml.org/lkml/2014/12/30/114
7 always running tasks, 4 on cluster 0, 3 on cluster 1:
cluster 0 cluster 1
capacity 1024 (2*512) 1024 (1*1024)
load 4096 3072
scale_load 2048 3072
Simply using cpu-scaled load in the existing lb code would declare
cluster 1 busier than cluster 0, although the compute capacity budget
for one task is higher on cluster 1 (1024/3 = 341) than on cluster 0
(2*512/4 = 256).
(2) A non-overload example does not show this problem:
7 12.5% (scaled to 1024) tasks, 4 on cluster 0, 3 on cluster 1:
cluster 0 cluster 1
capacity 1024 (2*512) 1024 (1*1024)
load 1024 384
scale_load 512 384
Here cluster 0 is busier taking load or cpu-scaled load.
We should continue to use avg_load based on load (maybe calculated out
of scaled load once introduced?) for overload scenarios and use
scale_load for non-overload scenarios. Since this hasn't been
implemented yet, we got rid of cpu-scaled load in
this RFC.
[...]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/