Re: [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE

From: luca abeni
Date: Wed Jan 04 2017 - 07:18:47 EST


Hi Daniel,

On Tue, 3 Jan 2017 19:58:38 +0100
Daniel Bristot de Oliveira <bristot@xxxxxxxxxx> wrote:

[...]
> In a four core box, if I dispatch 11 tasks [1] with setup:
>
> period = 30 ms
> runtime = 10 ms
> flags = 0 (GRUB disabled)
>
> I see this:
> ------------------------------- HTOP
> ------------------------------------ 1
> [|||||||||||||||||||||92.5%] Tasks: 128, 259 thr; 14 running 2
> [|||||||||||||||||||||91.0%] Load average: 4.65 4.66 4.81 3
> [|||||||||||||||||||||92.5%] Uptime: 05:12:43 4
> [|||||||||||||||||||||92.5%] Mem[|||||||||||||||1.13G/3.78G]
> Swp[ 0K/3.90G]
>
> PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
> 16247 root -101 0 4204 632 564 R 32.4 0.0 2:10.35 d
> 16249 root -101 0 4204 624 556 R 32.4 0.0 2:09.80 d
> 16250 root -101 0 4204 728 660 R 32.4 0.0 2:09.58 d
> 16252 root -101 0 4204 676 608 R 32.4 0.0 2:09.08 d
> 16253 root -101 0 4204 636 568 R 32.4 0.0 2:08.85 d
> 16254 root -101 0 4204 732 664 R 32.4 0.0 2:08.62 d
> 16255 root -101 0 4204 620 556 R 32.4 0.0 2:08.40 d
> 16257 root -101 0 4204 708 640 R 32.4 0.0 2:07.98 d
> 16256 root -101 0 4204 624 560 R 32.4 0.0 2:08.18 d
> 16248 root -101 0 4204 680 612 R 33.0 0.0 2:10.15 d
> 16251 root -101 0 4204 676 608 R 33.0 0.0 2:09.34 d
> 16259 root 20 0 124M 4692 3120 R 1.1 0.1 0:02.82 htop
> 2191 bristot 20 0 649M 41312 32048 S 0.0 1.0 0:28.77
> gnome-ter ------------------------------- HTOP
> ------------------------------------
>
> All tasks are using +- the same amount of CPU time, a little bit more
> than 30%, as expected.

Notice that, if I understand well, each task should receive 33.33% (1/3)
of CPU time. Anyway, I think this is ok...

> However, if I enable GRUB in the same task set I get this:
>
> ------------------------------- HTOP
> ------------------------------------ 1
> [|||||||||||||||||||||93.8%] Tasks: 128, 260 thr; 15 running 2
> [|||||||||||||||||||||95.2%] Load average: 5.13 5.01 4.98 3
> [|||||||||||||||||||||93.3%] Uptime: 05:01:02 4
> [|||||||||||||||||||||96.4%] Mem[|||||||||||||||1.13G/3.78G]
> Swp[ 0K/3.90G]
>
> PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
> 14967 root -101 0 4204 628 564 R 45.8 0.0 1h07:49 g
> 14962 root -101 0 4204 728 660 R 45.8 0.0 1h05:06 g
> 14959 root -101 0 4204 680 612 R 45.2 0.0 1h07:29 g
> 14927 root -101 0 4204 624 556 R 44.6 0.0 1h04:30 g
> 14928 root -101 0 4204 656 588 R 31.1 0.0 47:37.21 g
> 14961 root -101 0 4204 684 616 R 31.1 0.0 47:19.75 g
> 14968 root -101 0 4204 636 568 R 31.1 0.0 46:27.36 g
> 14960 root -101 0 4204 684 616 R 23.8 0.0 37:31.06 g
> 14969 root -101 0 4204 684 616 R 23.8 0.0 38:11.50 g
> 14925 root -101 0 4204 636 568 R 23.8 0.0 37:34.88 g
> 14926 root -101 0 4204 684 616 R 23.8 0.0 38:27.37 g
> 16182 root 20 0 124M 3972 3212 R 0.6 0.1 0:00.23 htop
> 862 root 20 0 264M 5668 4832 S 0.6 0.1 0:03.30
> iio-sensor 2191 bristot 20 0 649M 41312 32048 S 0.0 1.0
> 0:27.62 gnome-term 588 root 20 0 257M 121M 120M S 0.0
> 3.1 0:13.53 systemd-jo ------------------------------- HTOP
> ------------------------------------
>
> Some tasks start to use more CPU time, while others seems to use less
> CPU than it was reserved for them. See the task 14926, it is using
> only 23.8 % of the CPU, which is less than its 10/30 reservation.

What happened here is that some runqueues have an active utilisation
larger than 0.95. So, GRUB is decreasing the amount of time received by
the tasks on those runqueues to consume less than 95%... This is the
reason for the effect you noticed below:


> After some debugging, it seems that in this case GRUB is also
> _reducing_ the runtime of the task by making the notion of consumed
> runtime be greater than the actual consumed runtime.
[...]

Now, this is "kind of expected", because you have 11 tasks each one
having utilisation 1/3, distributed on 4 CPUs... So, some CPU will have
3 tasks on it, resulting in an utilisation = 1 > 0.95. But this should
not result in what you have seen in htop...
The real issue seems to be that at some point some runqueues have an
active utilisation = 1.33 (4 dl tasks in the runqueue), with other
runqueues only having 2 tasks... And this results in the huge imbalance
in utilisations you noticed. I am trying to understand why this
happens... It seems to me that a "pull_dl_task()" might end up pulling
more than 1 task... Is this possible?


Luca

>
> You can see this with this code snip:
>
> ------------------- %<-------------------
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 93ff400..1abb594 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -823,9 +823,21 @@ static void update_curr_dl(struct rq *rq)
>
> sched_rt_avg_update(rq, delta_exec);
>
> - if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
> - delta_exec = grub_reclaim(delta_exec, rq);
> - dl_se->runtime -= delta_exec;
> + if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
> + u64 new_delta_exec;
> + new_delta_exec = grub_reclaim(delta_exec, rq);
> + if (new_delta_exec > delta_exec)
> + trace_printk("new delta exec (%llu) is
> greater than delta exec (%llu) by %llu\n",
> + new_delta_exec,
> + delta_exec,
> + (new_delta_exec -
> delta_exec));
> + dl_se->runtime -= new_delta_exec;
> + }
> + else {
> + dl_se->runtime -= delta_exec;
> + }
> +
> +
>
> throttle:
> if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
> --------------------------- >% -------------
>
> It seems to be related to the "sched/deadline: do not reclaim the
> whole CPU bandwidth", because the trace_printk message I put starts
> to appear when we start to touch this limit, and the (new_delta_exec
> - delta_exec) seems to be somehow limited to the non_deadline_bw.
>
> Output with sysctl -w kernel.sched_rt_runtime_us=950000
> g-1984 [001] d.h1 1108.783349: update_curr_dl: new
> delta exec (1050043) is greater than delta exec (1000042) by 50001
> g-1983 [002] d.h1 1108.783349: update_curr_dl: new delta exec
> (1049974) is greater than delta exec (999976) by 49998 g-1981 [003]
> d.h1 1108.783350: update_curr_dl: new delta exec (1050054) is
> greater than delta exec (1000053) by 50001
>
> Output with sysctl -w kernel.sched_rt_runtime_us=900000
> g-1748 [001] d.h1 418.879815: update_curr_dl: new
> delta exec (1099995) is greater than delta exec (999996) by 99999
> g-1749 [002] d.h1 418.880815: update_curr_dl: new delta exec
> (1099986) is greater than delta exec (999988) by 99998 g-1748 [001]
> d.h1 418.880815: update_curr_dl: new delta exec (1099962) is
> greater than delta exec (999966) by 99996
>
> In the case of fewer tasks, this error appears just in the
> dispatch of a new task, stabilizing after some ms. But it
> does not stabilize when we are closer to the limit of the rt
> runtime.
>
> That is all I could find today. Am I missing something?
>
> [1] http://bristot.me/lkml/d.c
>
> -- Daniel