Re: [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE

From: Daniel Bristot de Oliveira
Date: Wed Jan 04 2017 - 13:02:18 EST


On 01/04/2017 05:42 PM, Luca Abeni wrote:
> Hi Daniel,
>
> 2017-01-04 16:14 GMT+01:00, Daniel Bristot de Oliveira <bristot@xxxxxxxxxx>:
>> On 01/04/2017 01:17 PM, luca abeni wrote:
>>> Hi Daniel,
>>>
>>> On Tue, 3 Jan 2017 19:58:38 +0100
>>> Daniel Bristot de Oliveira <bristot@xxxxxxxxxx> wrote:
>>>
>>> [...]
>>>> In a four core box, if I dispatch 11 tasks [1] with setup:
>>>>
>>>> period = 30 ms
>>>> runtime = 10 ms
>>>> flags = 0 (GRUB disabled)
>>>>
>>>> I see this:
>>>> ------------------------------- HTOP
>>>> ------------------------------------ 1
>>>> [|||||||||||||||||||||92.5%] Tasks: 128, 259 thr; 14 running 2
>>>> [|||||||||||||||||||||91.0%] Load average: 4.65 4.66 4.81 3
>>>> [|||||||||||||||||||||92.5%] Uptime: 05:12:43 4
>>>> [|||||||||||||||||||||92.5%] Mem[|||||||||||||||1.13G/3.78G]
>>>> Swp[ 0K/3.90G]
>>>>
>>>> PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
>>>> 16247 root -101 0 4204 632 564 R 32.4 0.0 2:10.35 d
>>>> 16249 root -101 0 4204 624 556 R 32.4 0.0 2:09.80 d
>>>> 16250 root -101 0 4204 728 660 R 32.4 0.0 2:09.58 d
>>>> 16252 root -101 0 4204 676 608 R 32.4 0.0 2:09.08 d
>>>> 16253 root -101 0 4204 636 568 R 32.4 0.0 2:08.85 d
>>>> 16254 root -101 0 4204 732 664 R 32.4 0.0 2:08.62 d
>>>> 16255 root -101 0 4204 620 556 R 32.4 0.0 2:08.40 d
>>>> 16257 root -101 0 4204 708 640 R 32.4 0.0 2:07.98 d
>>>> 16256 root -101 0 4204 624 560 R 32.4 0.0 2:08.18 d
>>>> 16248 root -101 0 4204 680 612 R 33.0 0.0 2:10.15 d
>>>> 16251 root -101 0 4204 676 608 R 33.0 0.0 2:09.34 d
>>>> 16259 root 20 0 124M 4692 3120 R 1.1 0.1 0:02.82 htop
>>>> 2191 bristot 20 0 649M 41312 32048 S 0.0 1.0 0:28.77
>>>> gnome-ter ------------------------------- HTOP
>>>> ------------------------------------
>>>>
>>>> All tasks are using +- the same amount of CPU time, a little bit more
>>>> than 30%, as expected.
>>>
>>> Notice that, if I understand well, each task should receive 33.33% (1/3)
>>> of CPU time. Anyway, I think this is ok...
>>
>> If we think on a partitioned system, yes for the CPUs in which 3 'd'
>> tasks are able to run. But as sched deadline is global by definition,
>> the load is:
>>
>> SUM(U_i) / M processors.
>>
>> 1/3 * 11 / 4 = 0.916666667
>>
>> So 10/30 (1/3) of this workload is:
>> 91.6 / 3 = 30.533333333
>>
>> Well, the rest is probably overheads, like scheduling, migration...
>
> I do not think this math is correct... Yes, the total utilization of
> the taskset is 0.91 (or 3.66, depending on how you define the
> utilization...), but I still think that the percentage of CPU time
> shown by "top" or "htop" should be 33.33 (or 8.33, depending on how
> the tool computes it).
> runtime=10 and period=30 means "schedule the task for 10ms every
> 30ms", so the task will consume 33% of the CPU time of a single core.
> In other words, 10/30 is a fraction of the CPU time, not a fraction of
> the time consumed by SCHED_DEADLINE tasks.

Ack! you are correct, I was so focused on global utilization that end up
missing this point. For the top/htop it should 33.3%.

>
>>>> However, if I enable GRUB in the same task set I get this:
>>>>
>>>> ------------------------------- HTOP
>>>> ------------------------------------ 1
>>>> [|||||||||||||||||||||93.8%] Tasks: 128, 260 thr; 15 running 2
>>>> [|||||||||||||||||||||95.2%] Load average: 5.13 5.01 4.98 3
>>>> [|||||||||||||||||||||93.3%] Uptime: 05:01:02 4
>>>> [|||||||||||||||||||||96.4%] Mem[|||||||||||||||1.13G/3.78G]
>>>> Swp[ 0K/3.90G]
>>>>
>>>> PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
>>>> 14967 root -101 0 4204 628 564 R 45.8 0.0 1h07:49 g
>>>> 14962 root -101 0 4204 728 660 R 45.8 0.0 1h05:06 g
>>>> 14959 root -101 0 4204 680 612 R 45.2 0.0 1h07:29 g
>>>> 14927 root -101 0 4204 624 556 R 44.6 0.0 1h04:30 g
>>>> 14928 root -101 0 4204 656 588 R 31.1 0.0 47:37.21 g
>>>> 14961 root -101 0 4204 684 616 R 31.1 0.0 47:19.75 g
>>>> 14968 root -101 0 4204 636 568 R 31.1 0.0 46:27.36 g
>>>> 14960 root -101 0 4204 684 616 R 23.8 0.0 37:31.06 g
>>>> 14969 root -101 0 4204 684 616 R 23.8 0.0 38:11.50 g
>>>> 14925 root -101 0 4204 636 568 R 23.8 0.0 37:34.88 g
>>>> 14926 root -101 0 4204 684 616 R 23.8 0.0 38:27.37 g
>>>> 16182 root 20 0 124M 3972 3212 R 0.6 0.1 0:00.23 htop
>>>> 862 root 20 0 264M 5668 4832 S 0.6 0.1 0:03.30
>>>> iio-sensor 2191 bristot 20 0 649M 41312 32048 S 0.0 1.0
>>>> 0:27.62 gnome-term 588 root 20 0 257M 121M 120M S 0.0
>>>> 3.1 0:13.53 systemd-jo ------------------------------- HTOP
>>>> ------------------------------------
>>>>
>>>> Some tasks start to use more CPU time, while others seems to use less
>>>> CPU than it was reserved for them. See the task 14926, it is using
>>>> only 23.8 % of the CPU, which is less than its 10/30 reservation.
>>>
>>> What happened here is that some runqueues have an active utilisation
>>> larger than 0.95. So, GRUB is decreasing the amount of time received by
>>> the tasks on those runqueues to consume less than 95%... This is the
>>> reason for the effect you noticed below:
>>
>> I see. But, AFAIK, the Linux's sched deadline measures the load
>> globally, not locally. So, it is not a problem having a load > than 95%
>> in the local queue if the global queue is < 95%.
>>
>> Am I missing something?
>
> The version of GRUB reclaiming implemented in my patches tracks a
> per-runqueue "active utilization", and uses it for reclaiming.

I _think_ that this might be (one of) the source(s) of the problem...

Just exercising...

For example, with my taskset, with a hypothetical perfect balance of the
whole runqueue, one possible scenario is:

CPU 0 1 2 3
# TASKS 3 3 3 2

In this case, CPUs 0 1 2 are with 100% of local utilization. Thus, the
current task on these CPUs will have their runtime decreased by GRUB.
Meanwhile, the luck tasks in the CPU 3 would use an additional time that
they "globally" do not have - because the system, globally, has a load
higher than the 66.6...% of the local runqueue. Actually, part of the
time decreased from tasks on [0-2] are being used by the tasks on 3,
until the next migration of any task, which will change the luck
tasks... but without any guaranty that all tasks will be the luck one on
every activation, causing the problem.

Does it make sense?

If it does, this let me think that only with the global track of
utilization we will achieve the correct result... but I may be missing
something... :-).

-- Daniel