Re: [PATCH 0/7] sched/deadline: fix cpusets bandwidth accounting

From: Luca Abeni
Date: Thu Aug 24 2017 - 03:53:39 EST


On Wed, 23 Aug 2017 13:47:13 -0600
Mathieu Poirier <mathieu.poirier@xxxxxxxxxx> wrote:
> >> This is a renewed attempt at fixing a problem reported by Steve Rostedt [1]
> >> where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug
> >> operations. When CPUhotplug and some CUPset manipulation take place root
> >> domains are destroyed and new ones created, loosing at the same time DL
> >> accounting pertaining to utilisation.
> >
> > Thanks for looking at this longstanding issue! I am just back from
> > vacations; in the next days I'll try your patches.
> > Do you have some kind of scripts for reproducing the issue
> > automatically? (I see that in the original email Steven described how
> > to reproduce it manually; I just wonder if anyone already scripted the
> > test).
>
> I didn't bother scripting it since it is so easy to do. I'm eager to
> see how things work out on your end.

Ok, so I'll try to reproduce the issue manually as described in Steven's
original email; I'll run some tests as soon as I finish with some stuff
that accumulated during vacations.

[...]
> >> OPEN ISSUE:
> >>
> >> Regardless of how we proceed (using existing CPUset list or new ones) we
> >> need to deal with DL tasks that span more than one root domain, something
> >> that will typically happen after a CPUset operation. For example, if we
> >> split the number of available CPUs on a system in two CPUsets and then turn
> >> off the 'sched_load_balance' flag on the parent CPUset, DL tasks in the
> >> parent CPUset will end up spanning two root domains.
> >>
> >> One way to deal with this is to prevent CPUset operations from happening
> >> when such condition is detected, as enacted in this set.
> >
> > I think this is the simplest (if not only?) solution if we want to use
> > gEDF in each root domain.
>
> Global Earliest Deadline First? Is my interpretation correct?

Right. As far as I understand, the original SCHED_DEADLINE design is to
partition the CPUs in disjoint sets, and then use global EDF scheduling
on each one of those sets (this guarantees bounded tardiness, and if
you run some additional admission tests in user space you can also
guarantee the hard respect of every deadline).


> >> Although simple
> >> this approach feels brittle and akin to a "whack-a-mole" game. A better
> >> and more reliable approach would be to teach the DL scheduler to deal with
> >> tasks that span multiple root domains, a serious and substantial
> >> undertaking.
> >>
> >> I am sending this as a starting point for discussion. I would be grateful
> >> if you could take the time to comment on the approach and most importantly
> >> provide input on how to deal with the open issue underlined above.
> >
> > I suspect that if we want to guarantee bounded tardiness then we have to
> > go for a solution similar to the one suggested by Tommaso some time ago
> > (if I remember well):
> >
> > if we want to create some "second level cpusets" inside a "parent
> > cpuset", allowing deadline tasks to be placed inside both the "parent
> > cpuset" and the "second level cpusets", then we have to subtract the
> > "second level cpusets" maximum utilizations from the "parent cpuset"
> > utilization.
> >
> > I am not sure how difficult it can be to implement this...
>
> Humm... I am missing some context here.

Or maybe I misunderstood the issue you were seeing (I am no expert on
cpusets). Is it related to hierarchies of cpusets (with one cpuset
contained inside another one)?
Can you describe how to reproduce the problematic situation?

> Nonetheless the approach I
> was contemplating was to repeat the current mathematics to all the
> root domains accessible from a p->cpus_allowed's flag.

I think in the original SCHED_DEADLINE design there should be only one
root domain compatible with the task's affinity... If this does not
happen, I suspect it is a bug (Juri, can you confirm?).

My understanding is that with SCHED_DEADLINE cpusets should be used to
partition the system's CPUs in disjoint sets (and I think there is one
root domain for each one of those disjoint sets). And the task affinity
mask should correspond with the CPUs composing the set in which the
task is executing.


> As such we'd
> have the same acceptance test but repeated to more than one root
> domain. To do that time can be an issue but the real problem I see is
> related to the current DL code. It is geared around a single root
> domain and changing that means meddling in a lot of places. I had a
> prototype that was beginning to address that but decided to gather
> people's opinion before getting in too deep.

I still do not fully understand this (I got the impression that this is
related to hierarchies of cpusets, but I am not sure if this
understanding is correct). Maybe an example would help me to understand.



Thanks,
Luca