Re: [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE

From: Daniel Bristot de Oliveira
Date: Tue Jan 03 2017 - 14:08:46 EST


On 12/30/2016 12:33 PM, Luca Abeni wrote:
> From: Luca Abeni <luca.abeni@xxxxxxxx>
>
> Hi all,
>
> here is a new version of the patchset implementing CPU reclaiming
> (using the GRUB algorithm[1]) for SCHED_DEADLINE.
> Basically, this feature allows SCHED_DEADLINE tasks to consume more
> than their reserved runtime, up to a maximum fraction of the CPU time
> (so that other tasks are left some spare CPU time to execute), if this
> does not break the guarantees of other SCHED_DEADLINE tasks.
> The patchset applies on top of tip/master.
>
>
> The implemented CPU reclaiming algorithm is based on tracking the
> utilization U_act of active tasks (first 2 patches), and modifying the
> runtime accounting rule (see patch 0004). The original GRUB algorithm is
> modified as described in [2] to support multiple CPUs (the original
> algorithm only considered one single CPU, this one tracks U_act per
> runqueue) and to leave an "unreclaimable" fraction of CPU time to non
> SCHED_DEADLINE tasks (see patch 0005: the original algorithm can consume
> 100% of the CPU time, starving all the other tasks).
> Patch 0003 uses the newly introduced "inactive timer" (introduced in
> patch 0002) to fix dl_overflow() and __setparam_dl().
> Patch 0006 allows to enable CPU reclaiming only on selected tasks.

Hi,

Today I did some tests in this patch set. Unfortunately, it seems that
there is a problem :-(.

In a four core box, if I dispatch 11 tasks [1] with setup:

period = 30 ms
runtime = 10 ms
flags = 0 (GRUB disabled)

I see this:
------------------------------- HTOP ------------------------------------
1 [|||||||||||||||||||||92.5%] Tasks: 128, 259 thr; 14 running
2 [|||||||||||||||||||||91.0%] Load average: 4.65 4.66 4.81
3 [|||||||||||||||||||||92.5%] Uptime: 05:12:43
4 [|||||||||||||||||||||92.5%]
Mem[|||||||||||||||1.13G/3.78G]
Swp[ 0K/3.90G]

PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
16247 root -101 0 4204 632 564 R 32.4 0.0 2:10.35 d
16249 root -101 0 4204 624 556 R 32.4 0.0 2:09.80 d
16250 root -101 0 4204 728 660 R 32.4 0.0 2:09.58 d
16252 root -101 0 4204 676 608 R 32.4 0.0 2:09.08 d
16253 root -101 0 4204 636 568 R 32.4 0.0 2:08.85 d
16254 root -101 0 4204 732 664 R 32.4 0.0 2:08.62 d
16255 root -101 0 4204 620 556 R 32.4 0.0 2:08.40 d
16257 root -101 0 4204 708 640 R 32.4 0.0 2:07.98 d
16256 root -101 0 4204 624 560 R 32.4 0.0 2:08.18 d
16248 root -101 0 4204 680 612 R 33.0 0.0 2:10.15 d
16251 root -101 0 4204 676 608 R 33.0 0.0 2:09.34 d
16259 root 20 0 124M 4692 3120 R 1.1 0.1 0:02.82 htop
2191 bristot 20 0 649M 41312 32048 S 0.0 1.0 0:28.77 gnome-ter
------------------------------- HTOP ------------------------------------

All tasks are using +- the same amount of CPU time, a little bit more
than 30%, as expected. However, if I enable GRUB in the same task set
I get this:

------------------------------- HTOP ------------------------------------
1 [|||||||||||||||||||||93.8%] Tasks: 128, 260 thr; 15 running
2 [|||||||||||||||||||||95.2%] Load average: 5.13 5.01 4.98
3 [|||||||||||||||||||||93.3%] Uptime: 05:01:02
4 [|||||||||||||||||||||96.4%]
Mem[|||||||||||||||1.13G/3.78G]
Swp[ 0K/3.90G]

PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
14967 root -101 0 4204 628 564 R 45.8 0.0 1h07:49 g
14962 root -101 0 4204 728 660 R 45.8 0.0 1h05:06 g
14959 root -101 0 4204 680 612 R 45.2 0.0 1h07:29 g
14927 root -101 0 4204 624 556 R 44.6 0.0 1h04:30 g
14928 root -101 0 4204 656 588 R 31.1 0.0 47:37.21 g
14961 root -101 0 4204 684 616 R 31.1 0.0 47:19.75 g
14968 root -101 0 4204 636 568 R 31.1 0.0 46:27.36 g
14960 root -101 0 4204 684 616 R 23.8 0.0 37:31.06 g
14969 root -101 0 4204 684 616 R 23.8 0.0 38:11.50 g
14925 root -101 0 4204 636 568 R 23.8 0.0 37:34.88 g
14926 root -101 0 4204 684 616 R 23.8 0.0 38:27.37 g
16182 root 20 0 124M 3972 3212 R 0.6 0.1 0:00.23 htop
862 root 20 0 264M 5668 4832 S 0.6 0.1 0:03.30 iio-sensor
2191 bristot 20 0 649M 41312 32048 S 0.0 1.0 0:27.62 gnome-term
588 root 20 0 257M 121M 120M S 0.0 3.1 0:13.53 systemd-jo
------------------------------- HTOP ------------------------------------

Some tasks start to use more CPU time, while others seems to use less
CPU than it was reserved for them. See the task 14926, it is using
only 23.8 % of the CPU, which is less than its 10/30 reservation.

I traced this task activation and noticed this:

swapper 0 [003] 14968.332244: sched:sched_switch: swapper/3:0 [120] R ==> g:14926 [-1]
g 14926 [003] 14968.339294: sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1]
runtime: 7050 us (14968.339294 - 14968.332244)

period: 29997 us (14968.362241 - 14968.332244)
swapper 0 [003] 14968.362241: sched:sched_switch: swapper/3:0 [120] R ==> g:14926 [-1]
g 14926 [003] 14968.369294: sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1]
runtime: 0.007053 us (14968.369294 = 14968.362241)

period: 29994 us (14968.392235 - 14968.362241)
swapper 0 [003] 14968.392235: sched:sched_switch: swapper/3:0 [120] R ==> g:14926 [-1]
g 14926 [003] 14968.399301: sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1]
runtime: 7066 us (14968.399301 - 14968.392235)

period: 30008 us (14968.422243 - 14968.392235)
swapper 0 [003] 14968.422243: sched:sched_switch: swapper/3:0 [120] R ==> g:14926 [-1]
g 14926 [003] 14968.429294: sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1]
runtime: 7051 us (14968.429294 - 14968.422243)

period: 29995 us (14968.452238 - 14968.422243)
swapper 0 [003] 14968.452238: sched:sched_switch: swapper/3:0 [120] R ==> g:14926 [-1]
g 14926 [003] 14968.459293: sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1]
runtime: 7055 us (14968.459293 - 14968.452238)

period: 30055 us (14968.482293 - 14968.452238)
g 14925 [003] 14968.482293: sched:sched_switch: g:14925 [-1] R ==> g:14926 [-1]
g 14926 [003] 14968.490293: sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1]
runtime: 8000 us (14968.490293 - 14968.482293)

The task is using less CPU than it was reserved/guaranteed.

After some debugging, it seems that in this case GRUB is also _reducing_
the runtime of the task by making the notion of consumed runtime
be greater than the actual consumed runtime.

You can see this with this code snip:

------------------- %<-------------------
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 93ff400..1abb594 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -823,9 +823,21 @@ static void update_curr_dl(struct rq *rq)

sched_rt_avg_update(rq, delta_exec);

- if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
- delta_exec = grub_reclaim(delta_exec, rq);
- dl_se->runtime -= delta_exec;
+ if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
+ u64 new_delta_exec;
+ new_delta_exec = grub_reclaim(delta_exec, rq);
+ if (new_delta_exec > delta_exec)
+ trace_printk("new delta exec (%llu) is greater than delta exec (%llu) by %llu\n",
+ new_delta_exec,
+ delta_exec,
+ (new_delta_exec - delta_exec));
+ dl_se->runtime -= new_delta_exec;
+ }
+ else {
+ dl_se->runtime -= delta_exec;
+ }
+
+

throttle:
if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
--------------------------- >% -------------

It seems to be related to the "sched/deadline: do not reclaim the whole
CPU bandwidth", because the trace_printk message I put starts to appear
when we start to touch this limit, and the (new_delta_exec - delta_exec)
seems to be somehow limited to the non_deadline_bw.

Output with sysctl -w kernel.sched_rt_runtime_us=950000
g-1984 [001] d.h1 1108.783349: update_curr_dl: new delta exec (1050043) is greater than delta exec (1000042) by 50001
g-1983 [002] d.h1 1108.783349: update_curr_dl: new delta exec (1049974) is greater than delta exec (999976) by 49998
g-1981 [003] d.h1 1108.783350: update_curr_dl: new delta exec (1050054) is greater than delta exec (1000053) by 50001

Output with sysctl -w kernel.sched_rt_runtime_us=900000
g-1748 [001] d.h1 418.879815: update_curr_dl: new delta exec (1099995) is greater than delta exec (999996) by 99999
g-1749 [002] d.h1 418.880815: update_curr_dl: new delta exec (1099986) is greater than delta exec (999988) by 99998
g-1748 [001] d.h1 418.880815: update_curr_dl: new delta exec (1099962) is greater than delta exec (999966) by 99996

In the case of fewer tasks, this error appears just in the
dispatch of a new task, stabilizing after some ms. But it
does not stabilize when we are closer to the limit of the rt
runtime.

That is all I could find today. Am I missing something?

[1] http://bristot.me/lkml/d.c

-- Daniel