Re: [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE

From: luca abeni
Date: Tue Jan 03 2017 - 16:33:57 EST

Next message: Michal Hocko: "Re: [PATCH] drivers/virt: use get_user_pages_unlocked()"
Previous message: Jason Gunthorpe: "Re: [tpmdd-devel] [PATCH RFC 0/4] RFC: in-kernel resource manager"
In reply to: Daniel Bristot de Oliveira: "Re: [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE"
Next in thread: luca abeni: "Re: [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Daniel,
(sorry for the previous html email; I replied from my phone and I did
not realise how the email client was configured)

On Tue, 3 Jan 2017 19:58:38 +0100
Daniel Bristot de Oliveira <bristot@xxxxxxxxxx> wrote:

[...]
> > The implemented CPU reclaiming algorithm is based on tracking the
> > utilization U_act of active tasks (first 2 patches), and modifying
> > the runtime accounting rule (see patch 0004). The original GRUB
> > algorithm is modified as described in [2] to support multiple CPUs
> > (the original algorithm only considered one single CPU, this one
> > tracks U_act per runqueue) and to leave an "unreclaimable" fraction
> > of CPU time to non SCHED_DEADLINE tasks (see patch 0005: the
> > original algorithm can consume 100% of the CPU time, starving all
> > the other tasks). Patch 0003 uses the newly introduced "inactive
> > timer" (introduced in patch 0002) to fix dl_overflow() and
> > __setparam_dl(). Patch 0006 allows to enable CPU reclaiming only on
> > selected tasks.
>
> Hi,
>
> Today I did some tests in this patch set. Unfortunately, it seems that
> there is a problem :-(.
[...]
I reproduced this issue; thanks for the report. It seems to be due to
the fact that the reclaiming tasks are more than the CPU cores and the
load is very high (near to the utilisation limit).

I am investigating it, and will hopefully post an update in the next
days.

Thanks,
Luca

>
> In a four core box, if I dispatch 11 tasks [1] with setup:
>
> period = 30 ms
> runtime = 10 ms
> flags = 0 (GRUB disabled)
>
> I see this:
> ------------------------------- HTOP
> ------------------------------------ 1
> [|||||||||||||||||||||92.5%] Tasks: 128, 259 thr; 14 running 2
> [|||||||||||||||||||||91.0%] Load average: 4.65 4.66 4.81 3
> [|||||||||||||||||||||92.5%] Uptime: 05:12:43 4
> [|||||||||||||||||||||92.5%] Mem[|||||||||||||||1.13G/3.78G]
> Swp[ 0K/3.90G]
>
> PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
> 16247 root -101 0 4204 632 564 R 32.4 0.0 2:10.35 d
> 16249 root -101 0 4204 624 556 R 32.4 0.0 2:09.80 d
> 16250 root -101 0 4204 728 660 R 32.4 0.0 2:09.58 d
> 16252 root -101 0 4204 676 608 R 32.4 0.0 2:09.08 d
> 16253 root -101 0 4204 636 568 R 32.4 0.0 2:08.85 d
> 16254 root -101 0 4204 732 664 R 32.4 0.0 2:08.62 d
> 16255 root -101 0 4204 620 556 R 32.4 0.0 2:08.40 d
> 16257 root -101 0 4204 708 640 R 32.4 0.0 2:07.98 d
> 16256 root -101 0 4204 624 560 R 32.4 0.0 2:08.18 d
> 16248 root -101 0 4204 680 612 R 33.0 0.0 2:10.15 d
> 16251 root -101 0 4204 676 608 R 33.0 0.0 2:09.34 d
> 16259 root 20 0 124M 4692 3120 R 1.1 0.1 0:02.82 htop
> 2191 bristot 20 0 649M 41312 32048 S 0.0 1.0 0:28.77
> gnome-ter ------------------------------- HTOP
> ------------------------------------
>
> All tasks are using +- the same amount of CPU time, a little bit more
> than 30%, as expected. However, if I enable GRUB in the same task set
> I get this:
>
> ------------------------------- HTOP
> ------------------------------------ 1
> [|||||||||||||||||||||93.8%] Tasks: 128, 260 thr; 15 running 2
> [|||||||||||||||||||||95.2%] Load average: 5.13 5.01 4.98 3
> [|||||||||||||||||||||93.3%] Uptime: 05:01:02 4
> [|||||||||||||||||||||96.4%] Mem[|||||||||||||||1.13G/3.78G]
> Swp[ 0K/3.90G]
>
> PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
> 14967 root -101 0 4204 628 564 R 45.8 0.0 1h07:49 g
> 14962 root -101 0 4204 728 660 R 45.8 0.0 1h05:06 g
> 14959 root -101 0 4204 680 612 R 45.2 0.0 1h07:29 g
> 14927 root -101 0 4204 624 556 R 44.6 0.0 1h04:30 g
> 14928 root -101 0 4204 656 588 R 31.1 0.0 47:37.21 g
> 14961 root -101 0 4204 684 616 R 31.1 0.0 47:19.75 g
> 14968 root -101 0 4204 636 568 R 31.1 0.0 46:27.36 g
> 14960 root -101 0 4204 684 616 R 23.8 0.0 37:31.06 g
> 14969 root -101 0 4204 684 616 R 23.8 0.0 38:11.50 g
> 14925 root -101 0 4204 636 568 R 23.8 0.0 37:34.88 g
> 14926 root -101 0 4204 684 616 R 23.8 0.0 38:27.37 g
> 16182 root 20 0 124M 3972 3212 R 0.6 0.1 0:00.23 htop
> 862 root 20 0 264M 5668 4832 S 0.6 0.1 0:03.30
> iio-sensor 2191 bristot 20 0 649M 41312 32048 S 0.0 1.0
> 0:27.62 gnome-term 588 root 20 0 257M 121M 120M S 0.0
> 3.1 0:13.53 systemd-jo ------------------------------- HTOP
> ------------------------------------
>
> Some tasks start to use more CPU time, while others seems to use less
> CPU than it was reserved for them. See the task 14926, it is using
> only 23.8 % of the CPU, which is less than its 10/30 reservation.
>
> I traced this task activation and noticed this:
>
> swapper 0 [003] 14968.332244: sched:sched_switch:
> swapper/3:0 [120] R ==> g:14926 [-1] g 14926 [003] 14968.339294:
> sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1] runtime: 7050 us
> (14968.339294 - 14968.332244)
>
> period: 29997 us (14968.362241 - 14968.332244)
> swapper 0 [003] 14968.362241: sched:sched_switch:
> swapper/3:0 [120] R ==> g:14926 [-1] g 14926 [003] 14968.369294:
> sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1] runtime: 0.007053
> us (14968.369294 = 14968.362241)
>
> period: 29994 us (14968.392235 - 14968.362241)
> swapper 0 [003] 14968.392235: sched:sched_switch:
> swapper/3:0 [120] R ==> g:14926 [-1] g 14926 [003] 14968.399301:
> sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1] runtime: 7066 us
> (14968.399301 - 14968.392235)
>
> period: 30008 us (14968.422243 - 14968.392235)
> swapper 0 [003] 14968.422243: sched:sched_switch:
> swapper/3:0 [120] R ==> g:14926 [-1] g 14926 [003] 14968.429294:
> sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1] runtime: 7051 us
> (14968.429294 - 14968.422243)
>
> period: 29995 us (14968.452238 - 14968.422243)
> swapper 0 [003] 14968.452238: sched:sched_switch:
> swapper/3:0 [120] R ==> g:14926 [-1] g 14926 [003] 14968.459293:
> sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1] runtime: 7055 us
> (14968.459293 - 14968.452238)
>
> period: 30055 us (14968.482293 - 14968.452238)
> g 14925 [003] 14968.482293: sched:sched_switch:
> g:14925 [-1] R ==> g:14926 [-1] g 14926 [003] 14968.490293:
> sched:sched_switch: g:14926 [-1] R ==> g:14960 [-1] runtime: 8000 us
> (14968.490293 - 14968.482293)
>
> The task is using less CPU than it was reserved/guaranteed.
>
> After some debugging, it seems that in this case GRUB is also
> _reducing_ the runtime of the task by making the notion of consumed
> runtime be greater than the actual consumed runtime.
>
> You can see this with this code snip:
>
> ------------------- %<-------------------
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 93ff400..1abb594 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -823,9 +823,21 @@ static void update_curr_dl(struct rq *rq)
>
> sched_rt_avg_update(rq, delta_exec);
>
> - if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM))
> - delta_exec = grub_reclaim(delta_exec, rq);
> - dl_se->runtime -= delta_exec;
> + if (unlikely(dl_se->flags & SCHED_FLAG_RECLAIM)) {
> + u64 new_delta_exec;
> + new_delta_exec = grub_reclaim(delta_exec, rq);
> + if (new_delta_exec > delta_exec)
> + trace_printk("new delta exec (%llu) is
> greater than delta exec (%llu) by %llu\n",
> + new_delta_exec,
> + delta_exec,
> + (new_delta_exec -
> delta_exec));
> + dl_se->runtime -= new_delta_exec;
> + }
> + else {
> + dl_se->runtime -= delta_exec;
> + }
> +
> +
>
> throttle:
> if (dl_runtime_exceeded(dl_se) || dl_se->dl_yielded) {
> --------------------------- >% -------------
>
> It seems to be related to the "sched/deadline: do not reclaim the
> whole CPU bandwidth", because the trace_printk message I put starts
> to appear when we start to touch this limit, and the (new_delta_exec
> - delta_exec) seems to be somehow limited to the non_deadline_bw.
>
> Output with sysctl -w kernel.sched_rt_runtime_us=950000
> g-1984 [001] d.h1 1108.783349: update_curr_dl: new
> delta exec (1050043) is greater than delta exec (1000042) by 50001
> g-1983 [002] d.h1 1108.783349: update_curr_dl: new delta exec
> (1049974) is greater than delta exec (999976) by 49998 g-1981 [003]
> d.h1 1108.783350: update_curr_dl: new delta exec (1050054) is
> greater than delta exec (1000053) by 50001
>
> Output with sysctl -w kernel.sched_rt_runtime_us=900000
> g-1748 [001] d.h1 418.879815: update_curr_dl: new
> delta exec (1099995) is greater than delta exec (999996) by 99999
> g-1749 [002] d.h1 418.880815: update_curr_dl: new delta exec
> (1099986) is greater than delta exec (999988) by 99998 g-1748 [001]
> d.h1 418.880815: update_curr_dl: new delta exec (1099962) is
> greater than delta exec (999966) by 99996
>
> In the case of fewer tasks, this error appears just in the
> dispatch of a new task, stabilizing after some ms. But it
> does not stabilize when we are closer to the limit of the rt
> runtime.
>
> That is all I could find today. Am I missing something?
>
> [1] http://bristot.me/lkml/d.c
>
> -- Daniel

Next message: Michal Hocko: "Re: [PATCH] drivers/virt: use get_user_pages_unlocked()"
Previous message: Jason Gunthorpe: "Re: [tpmdd-devel] [PATCH RFC 0/4] RFC: in-kernel resource manager"
In reply to: Daniel Bristot de Oliveira: "Re: [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE"
Next in thread: luca abeni: "Re: [RFC v4 0/6] CPU reclaiming for SCHED_DEADLINE"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]