Re: [PATCH 0/7] sched/deadline: fix cpusets bandwidth accounting

From: luca abeni
Date: Fri Aug 25 2017 - 02:04:04 EST


Hi Mathieu,

On Thu, 24 Aug 2017 14:32:20 -0600
Mathieu Poirier <mathieu.poirier@xxxxxxxxxx> wrote:
[...]
> >> > if we want to create some "second level cpusets" inside a "parent
> >> > cpuset", allowing deadline tasks to be placed inside both the
> >> > "parent cpuset" and the "second level cpusets", then we have to
> >> > subtract the "second level cpusets" maximum utilizations from
> >> > the "parent cpuset" utilization.
> >> >
> >> > I am not sure how difficult it can be to implement this...
> >>
> >> Humm... I am missing some context here.
> >
> > Or maybe I misunderstood the issue you were seeing (I am no expert
> > on cpusets). Is it related to hierarchies of cpusets (with one
> > cpuset contained inside another one)?
>
> Having spent a lot of time in the CPUset code, I can understand the
> confusion.
>
> CPUset allows to create a hierarchy of sets, _seemingly_ creating
> overlapping root domains. Fortunately that isn't the case -
> overlapping CPUsets are morphed together to create non-overlapping
> root domains. The magic happens in rebuild_sched_domains_locked() [1]
> where generate_sched_domains() [2] transforms any CPUset topology into
> disjoint domains.

Ok; thanks for explaining

[...]
> root@linaro-developer:/sys/fs/cgroup/cpuset# mkdir set1 set2
> root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0 > set1/cpuset.mem
> root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0 > set2/cpuset.mems
> root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0,1 >
> set1/cpuset.cpus root@linaro-developer:/sys/fs/cgroup/cpuset# echo
> 2,3 > set2/cpuset.cpus root@linaro-developer:/sys/fs/cgroup/cpuset#
> echo 0 > cpuset.sched_load_balance
> root@linaro-developer:/sys/fs/cgroup/cpuset#
>
> At this time runqueue0 and runqueue1 point to root domain A while
> runqueue2 and runqueue3 point to root domain B (something that can't
> be seen without adding more instrumentation).

Ok; up to here, everything is clear to me ;-)

> Newly created tasks can roam on all the CPUs available:
>
>
> root@linaro-developer:/home/linaro# ./burn &
> [1] 3973
> root@linaro-developer:/home/linaro# grep
> Cpus_allowed: /proc/3973/status Cpus_allowed: f
> root@linaro-developer:/home/linaro#

This happens because the task is not in set1 nor in set2, right? I
_think_ (but I am not sure; I did not design this part of
SCHED_DEADLINE) that the original idea was that in this situation
SCHED_DEADLINE tasks can be only in set1 or in set2 (SCHED_DEADLINE
tasks are not allowed to be in the "default" CPUset, in this setup).
Is this what one of your later patches enforces?


> The above demonstrate that even if we have two CPUsets new task belong
> to the "default" CPUset and as such can use all the available CPUs.

I still have a doubt (probably showing all my ignorance about
CPUsets :)... In this situation, we have 3 CPUsets: "default",
set1, and set2... Is everyone of these CPUsets associated to a
root domain (so, we have 3 root domains)? Or only set1 and set2 are
associated to a root domain?


> Now let's make task 3973 a DL task:
>
> root@linaro-developer:/home/linaro# ./schedtool -E -t 900000:1000000
> 3973 root@linaro-developer:/home/linaro# grep dl /proc/sched_debug
> dl_rq[0]:
> .dl_nr_running : 0
> .dl_nr_migratory : 0
> .dl_bw->bw : 996147
> .dl_bw->total_bw : 0 <------ Problem

Ok; I think I understand the problem, now...


> dl_rq[3]:
> .dl_nr_running : 0
> .dl_nr_migratory : 0
> .dl_bw->bw : 996147
> .dl_bw->total_bw : 943718 <------ As expected
> root@linaro-developer:/home/linaro/jlelli#
>
> When task 3973 was promoted to a DL task it was running on either CPU2
> or CPU3. The acceptance test was done on root domain B and the task
> utilisation added as expected. But as pointed out above task 3973 can
> still be scheduled on CPU0 and CPU1 and that is a problem since the
> utilisation hasn't been added there as well. The task is now spread
> over two root domains rather than a single one, as currently expected
> by the DL code (note that there are many ways to reproduce this
> situation).

I think this is a bug, and the only reasonable solution is to allow the
task to become SCHED_DEADLINE if it is in set1 or set2 (so, if its
affinity mask coincides exactly with all of the CPUs of the root domain
where the task utilization is added).


> In its current form the patchset prevents specific operations from
> being carried out if we recognise that a task could end up spanning
> more than a single root domain.

Good. I think this is the right way to go.


> But that will break as soon as we
> find a new way to create a DL task that spans multiple domains (and I
> may not have caught them all either).

So, we need to fix that too ;-)


> Another way to fix this is to do an acceptance test on all the root
> domain of a task.

I think we need to undestand what's the inteded behaviour of
SCHED_DEADLINE in this situation... My understanding is that
SCHED_DEADLINE is designed to do global EDF scheduling inside an
"isolated" CPUset; a SCHED_DEADLINE task spanning multiple domains would
break some SCHED_DEADLINE properties (from the scheduling theory
point of view) in some interesting ways...

I am not saying we should not do this, but I believe that allowing
tasks to span multiple domains require some redesign of the admission
test and migration mechanisms in SCHED_DEADLINE.

I think this is related to the "generic affinities" issue that Peter
mentioned some time ago.


> So above we'd run the acceptance test on root
> domain A and B before promoting the task. Of course we'd also have to
> add the utilisation of that task to both root domain. Although simple
> it goes at the core of the DL scheduler and touches pretty much every
> aspect of it, something I'm reluctant to embark on.

I see... So, the "default" CPUset does not have any root domain
associated to it? If it had, we could just subtract the maximum
utilizations of set1 and set2 to it when creating the root domains of
set1 and set2.



Thanks,
Luca

>
> [1].
> http://elixir.free-electrons.com/linux/latest/source/kernel/cgroup/cpuset.c#L814
> [2].
> http://elixir.free-electrons.com/linux/latest/source/kernel/cgroup/cpuset.c#L634
> [3]. https://github.com/jlelli/tests.git [4].
> https://github.com/jlelli/schedtool-dl.git [5].
> https://lkml.org/lkml/2016/2/3/966
>
> >
> >> Nonetheless the approach I
> >> was contemplating was to repeat the current mathematics to all the
> >> root domains accessible from a p->cpus_allowed's flag.
> >
> > I think in the original SCHED_DEADLINE design there should be only
> > one root domain compatible with the task's affinity... If this does
> > not happen, I suspect it is a bug (Juri, can you confirm?).
> >
> > My understanding is that with SCHED_DEADLINE cpusets should be used
> > to partition the system's CPUs in disjoint sets (and I think there
> > is one root domain for each one of those disjoint sets). And the
> > task affinity mask should correspond with the CPUs composing the
> > set in which the task is executing.
> >
> >
> >> As such we'd
> >> have the same acceptance test but repeated to more than one root
> >> domain. To do that time can be an issue but the real problem I
> >> see is related to the current DL code. It is geared around a
> >> single root domain and changing that means meddling in a lot of
> >> places. I had a prototype that was beginning to address that but
> >> decided to gather people's opinion before getting in too deep.
> >
> > I still do not fully understand this (I got the impression that
> > this is related to hierarchies of cpusets, but I am not sure if this
> > understanding is correct). Maybe an example would help me to
> > understand.
>
> The above should say it all - please get back to me if I haven't
> expressed myself clearly.
>
> >
> >
> >
> > Thanks,
> > Luca