Re: [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE

From: Daniel Bristot de Oliveira
Date: Wed Jan 04 2017 - 10:15:24 EST


On 01/04/2017 01:17 PM, luca abeni wrote:
> Hi Daniel,
>
> On Tue, 3 Jan 2017 19:58:38 +0100
> Daniel Bristot de Oliveira <bristot@xxxxxxxxxx> wrote:
>
> [...]
>> In a four core box, if I dispatch 11 tasks [1] with setup:
>>
>> period = 30 ms
>> runtime = 10 ms
>> flags = 0 (GRUB disabled)
>>
>> I see this:
>> ------------------------------- HTOP
>> ------------------------------------ 1
>> [|||||||||||||||||||||92.5%] Tasks: 128, 259 thr; 14 running 2
>> [|||||||||||||||||||||91.0%] Load average: 4.65 4.66 4.81 3
>> [|||||||||||||||||||||92.5%] Uptime: 05:12:43 4
>> [|||||||||||||||||||||92.5%] Mem[|||||||||||||||1.13G/3.78G]
>> Swp[ 0K/3.90G]
>>
>> PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
>> 16247 root -101 0 4204 632 564 R 32.4 0.0 2:10.35 d
>> 16249 root -101 0 4204 624 556 R 32.4 0.0 2:09.80 d
>> 16250 root -101 0 4204 728 660 R 32.4 0.0 2:09.58 d
>> 16252 root -101 0 4204 676 608 R 32.4 0.0 2:09.08 d
>> 16253 root -101 0 4204 636 568 R 32.4 0.0 2:08.85 d
>> 16254 root -101 0 4204 732 664 R 32.4 0.0 2:08.62 d
>> 16255 root -101 0 4204 620 556 R 32.4 0.0 2:08.40 d
>> 16257 root -101 0 4204 708 640 R 32.4 0.0 2:07.98 d
>> 16256 root -101 0 4204 624 560 R 32.4 0.0 2:08.18 d
>> 16248 root -101 0 4204 680 612 R 33.0 0.0 2:10.15 d
>> 16251 root -101 0 4204 676 608 R 33.0 0.0 2:09.34 d
>> 16259 root 20 0 124M 4692 3120 R 1.1 0.1 0:02.82 htop
>> 2191 bristot 20 0 649M 41312 32048 S 0.0 1.0 0:28.77
>> gnome-ter ------------------------------- HTOP
>> ------------------------------------
>>
>> All tasks are using +- the same amount of CPU time, a little bit more
>> than 30%, as expected.
>
> Notice that, if I understand well, each task should receive 33.33% (1/3)
> of CPU time. Anyway, I think this is ok...

If we think on a partitioned system, yes for the CPUs in which 3 'd'
tasks are able to run. But as sched deadline is global by definition,
the load is:

SUM(U_i) / M processors.

1/3 * 11 / 4 = 0.916666667

So 10/30 (1/3) of this workload is:
91.6 / 3 = 30.533333333

Well, the rest is probably overheads, like scheduling, migration...

>> However, if I enable GRUB in the same task set I get this:
>>
>> ------------------------------- HTOP
>> ------------------------------------ 1
>> [|||||||||||||||||||||93.8%] Tasks: 128, 260 thr; 15 running 2
>> [|||||||||||||||||||||95.2%] Load average: 5.13 5.01 4.98 3
>> [|||||||||||||||||||||93.3%] Uptime: 05:01:02 4
>> [|||||||||||||||||||||96.4%] Mem[|||||||||||||||1.13G/3.78G]
>> Swp[ 0K/3.90G]
>>
>> PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
>> 14967 root -101 0 4204 628 564 R 45.8 0.0 1h07:49 g
>> 14962 root -101 0 4204 728 660 R 45.8 0.0 1h05:06 g
>> 14959 root -101 0 4204 680 612 R 45.2 0.0 1h07:29 g
>> 14927 root -101 0 4204 624 556 R 44.6 0.0 1h04:30 g
>> 14928 root -101 0 4204 656 588 R 31.1 0.0 47:37.21 g
>> 14961 root -101 0 4204 684 616 R 31.1 0.0 47:19.75 g
>> 14968 root -101 0 4204 636 568 R 31.1 0.0 46:27.36 g
>> 14960 root -101 0 4204 684 616 R 23.8 0.0 37:31.06 g
>> 14969 root -101 0 4204 684 616 R 23.8 0.0 38:11.50 g
>> 14925 root -101 0 4204 636 568 R 23.8 0.0 37:34.88 g
>> 14926 root -101 0 4204 684 616 R 23.8 0.0 38:27.37 g
>> 16182 root 20 0 124M 3972 3212 R 0.6 0.1 0:00.23 htop
>> 862 root 20 0 264M 5668 4832 S 0.6 0.1 0:03.30
>> iio-sensor 2191 bristot 20 0 649M 41312 32048 S 0.0 1.0
>> 0:27.62 gnome-term 588 root 20 0 257M 121M 120M S 0.0
>> 3.1 0:13.53 systemd-jo ------------------------------- HTOP
>> ------------------------------------
>>
>> Some tasks start to use more CPU time, while others seems to use less
>> CPU than it was reserved for them. See the task 14926, it is using
>> only 23.8 % of the CPU, which is less than its 10/30 reservation.
>
> What happened here is that some runqueues have an active utilisation
> larger than 0.95. So, GRUB is decreasing the amount of time received by
> the tasks on those runqueues to consume less than 95%... This is the
> reason for the effect you noticed below:

I see. But, AFAIK, the Linux's sched deadline measures the load
globally, not locally. So, it is not a problem having a load > than 95%
in the local queue if the global queue is < 95%.

Am I missing something?

>
>> After some debugging, it seems that in this case GRUB is also
>> _reducing_ the runtime of the task by making the notion of consumed
>> runtime be greater than the actual consumed runtime.
> [...]
>
> Now, this is "kind of expected", because you have 11 tasks each one
> having utilisation 1/3, distributed on 4 CPUs... So, some CPU will have
> 3 tasks on it, resulting in an utilisation = 1 > 0.95. But this should
> not result in what you have seen in htop...

Well, the sched deadline aims to schedule the M highest priority tasks,
and migrates tasks to achieve this goal. However, I am not sure if
having the whole runqueue balance is a goal/restriction/feature of the
deadline scheduler.

Maybe this is the difference between the GRUB and sched deadline
assumptions that is causing the problem. Just thinking aloud.

> The real issue seems to be that at some point some runqueues have an
> active utilisation = 1.33 (4 dl tasks in the runqueue), with other
> runqueues only having 2 tasks... And this results in the huge imbalance
> in utilisations you noticed. I am trying to understand why this
> happens... It seems to me that a "pull_dl_task()" might end up pulling
> more than 1 task... Is this possible?

Yeah, this explain the numbers.

Brainstorm time! (sorry if it sounds obviously unfeasible):
Is it possible to think on GRUB tracking the global utilization?

-- Daniel