Re: [PATCH 0/7] sched/deadline: fix cpusets bandwidth accounting

From: Mathieu Poirier
Date: Wed Aug 23 2017 - 15:47:22 EST


On 22 August 2017 at 06:21, Luca Abeni <luca.abeni@xxxxxxxxxxxxxxx> wrote:
> Hi Mathieu,

Good day to you,

>
> On Wed, 16 Aug 2017 15:20:36 -0600
> Mathieu Poirier <mathieu.poirier@xxxxxxxxxx> wrote:
>
>> This is a renewed attempt at fixing a problem reported by Steve Rostedt [1]
>> where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug
>> operations. When CPUhotplug and some CUPset manipulation take place root
>> domains are destroyed and new ones created, loosing at the same time DL
>> accounting pertaining to utilisation.
>
> Thanks for looking at this longstanding issue! I am just back from
> vacations; in the next days I'll try your patches.
> Do you have some kind of scripts for reproducing the issue
> automatically? (I see that in the original email Steven described how
> to reproduce it manually; I just wonder if anyone already scripted the
> test).

I didn't bother scripting it since it is so easy to do. I'm eager to
see how things work out on your end.

>
>> An earlier attempt by Juri [2] used the scheduling classes' rq_online() and
>> rq_offline() methods, something that highlighted a problem with sleeping
>> DL tasks. The email thread that followed envisioned creating a list of
>> sleeping tasks to circle through when recomputing DL accounting.
>>
>> In this set the problem is addressed by relying on existing list of tasks
>> (sleeping or not) already maintained by CPUsets. When CPUset or
>> CPUhotplug operations have completed we circle through the list of tasks
>> maintained by each CPUset looking for DL tasks. When a DL task is found
>> its utilisation is added to the root domain it pertains to by way of its
>> runqueue.
>>
>> The advantage of proceeding this way is that recomputing of DL accounting
>> is done the same way for both active and inactive tasks, along with
>> guaranteeing that DL accounting for tasks end up in the correct root
>> domain regardless of the CPUset topology. The disadvantage is that
>> circling through all the tasks in a CPUset can be time consuming. The
>> counter argument is that both CPUset and CPUhotplug operations are time
>> consuming in the first place.
>
> I do not know the cpuset code too much, but I agree that your approach
> looks better than creating an additional list for blocked deadline
> tasks.
>
>
>> OPEN ISSUE:
>>
>> Regardless of how we proceed (using existing CPUset list or new ones) we
>> need to deal with DL tasks that span more than one root domain, something
>> that will typically happen after a CPUset operation. For example, if we
>> split the number of available CPUs on a system in two CPUsets and then turn
>> off the 'sched_load_balance' flag on the parent CPUset, DL tasks in the
>> parent CPUset will end up spanning two root domains.
>>
>> One way to deal with this is to prevent CPUset operations from happening
>> when such condition is detected, as enacted in this set.
>
> I think this is the simplest (if not only?) solution if we want to use
> gEDF in each root domain.

Global Earliest Deadline First? Is my interpretation correct?

>
>> Although simple
>> this approach feels brittle and akin to a "whack-a-mole" game. A better
>> and more reliable approach would be to teach the DL scheduler to deal with
>> tasks that span multiple root domains, a serious and substantial
>> undertaking.
>>
>> I am sending this as a starting point for discussion. I would be grateful
>> if you could take the time to comment on the approach and most importantly
>> provide input on how to deal with the open issue underlined above.
>
> I suspect that if we want to guarantee bounded tardiness then we have to
> go for a solution similar to the one suggested by Tommaso some time ago
> (if I remember well):
>
> if we want to create some "second level cpusets" inside a "parent
> cpuset", allowing deadline tasks to be placed inside both the "parent
> cpuset" and the "second level cpusets", then we have to subtract the
> "second level cpusets" maximum utilizations from the "parent cpuset"
> utilization.
>
> I am not sure how difficult it can be to implement this...

Humm... I am missing some context here. Nonetheless the approach I
was contemplating was to repeat the current mathematics to all the
root domains accessible from a p->cpus_allowed's flag. As such we'd
have the same acceptance test but repeated to more than one root
domain. To do that time can be an issue but the real problem I see is
related to the current DL code. It is geared around a single root
domain and changing that means meddling in a lot of places. I had a
prototype that was beginning to address that but decided to gather
people's opinion before getting in too deep.

>
>
> If, instead, we want to allow to guarantee the respect of all the
> deadlines, then we need to have a look at Brandenburg's paper on
> arbitrary affinities:
> https://people.mpi-sws.org/~bbb/papers/pdf/rtsj14.pdf
>

Ouch, that's an extended read...

>
> Thanks,
> Luca