Re: [PATCH] sched/deadline: Fix priority inheritance with multiple scheduling classes

From: Daniel Bristot de Oliveira
Date: Thu Nov 05 2020 - 11:34:00 EST


On 11/5/20 5:12 PM, Juri Lelli wrote:
> Hi Valentin,
>
> On 05/11/20 15:49, Valentin Schneider wrote:
>>
>> Hi Juri,
>>
>> On 05/11/20 07:50, Juri Lelli wrote:
>>> He also provided a simple reproducer creating the situation below:
>>>
>>> So the execution order of locking steps are the following
>>> (N1 and N2 are non-deadline tasks. D1 is a deadline task. M1 and M2
>>> are mutexes that are enabled * with priority inheritance.)
>>>
>>> Time moves forward as this timeline goes down:
>>>
>>> N1 N2 D1
>>> | | |
>>> | | |
>>> Lock(M1) | |
>>> | | |
>>> | Lock(M2) |
>>> | | |
>>> | | Lock(M2)
>>> | | |
>>> | Lock(M1) |
>>> | (!!bug triggered!) |
>>>
>>> Daniel reported a similar situation as well, by just letting ksoftirqd
>>> run with DEADLINE (and eventually block on a mutex).
>>>
>>> Problem is that boosted entities (Priority Inheritance) use static
>>> DEADLINE parameters of the top priority waiter. However, there might be
>>> cases where top waiter could be a non-DEADLINE entity that is currently
>>> boosted by a DEADLINE entity from a different lock chain (i.e., nested
>>> priority chains involving entities of non-DEADLINE classes). In this
>>> case, top waiter static DEADLINE parameters could be null (initialized
>>> to 0 at fork()) and replenish_dl_entity() would hit a BUG().
>>>
>>
>> IIUC, rt_mutex_get_top_task(N1) == N2, and N2->dl->deadline = 0, which
>> makes enqueue_task_dl() unhappy. And that happens because, unlike p->prio,
>> DL parameters aren't properly propagated through the chain(s).
>
> Yep. Replenishment as well, as it's going to find dl_runtime == 0.
>
>>> Fix this by keeping track of the original donor and using its parameters
>>> when a task is boosted.
>>>
>>> Reported-by: Glenn Elliott <glenn@xxxxxxxxxxx>
>>> Reported-by: Daniel Bristot de Oliveira <bristot@xxxxxxxxxx>
>>> Signed-off-by: Juri Lelli <juri.lelli@xxxxxxxxxx>
>>>
>>> ---
>>>
>>> This is actually a v2 attempt (didn't keep $SUBJECT since it's quite
>>> different than v1 [1]) to fix this problem.
>>>
>>> v1 was admittedly pretty ugly. Hope this looks slightly better (even
>>> though there is of course overhead associated to the additional
>>> pointer).
>>>
>>> Again, the proper way to fix this is by proxy execution. But, I don't
>>> think we are yet there and this problem needs a quick band-aid.
>>>
>>> One could probably also think to complicate the present approach and
>>> actually perform accounting using donor's dynamic parameters, but I fear
>>> it would be of little benefit since it would still bring all the
>>> problems associated to affinities. So, I'd propose let's try to fix all
>>> this properly with proxy and just avoid crashes in the meantime.
>>>
>>
>> For my own sake, what affinity problems are you thinking of?
>>
>> With proxy exec we have this "funny" dance of shoving the entire blocked-on
>> chain on a single runqueue to get the right selection out of
>> pick_next_task(), and that needs to deal with affinity (i.e. move the task
>> back to a sensible rq once it becomes runnable).
>>
>> With the current PI, the waiting tasks are blocked and enqueued in the
>> pi_waiters tree, so as I see it affinity shouldn't matter; what am I
>> missing / not seeing? Is that related to bandwidth handling?
>
> Think we might break admission control checks if donor and bosted are,
> for example, on different exclusive sets of CPUs. Guess that is a
> problem with proxy as well, though. As said in the comment above, this
> is unfortunately not much more than a band-aid. Hoping we can buy us
> some time and fix it properly with proxy.

I agree with Juri that the current approach is known to be broken,
and that the proxy execution seems to be the mechanisms to go to
try to address these problems. However, this will take some time.

Meanwhile, this patch that Juri proposes fixes problem
in the current mechanism - using the same approach (and breaking
in a known way :D).

A proper way to handle the priority inversion with a disjoint
set of CPUs is something that will also be an issue with proxy
execution. But that is an even more complex topic :-(.

So, IMHO, Juri's patch works well to avoid a crash,
making the system to behave as we expected (even if
we know that we cannot expect too much).

>
>>> 1 - 20191112075056.19971-1-juri.lelli@xxxxxxxxxx
>>> ---
>>> include/linux/sched.h | 9 +++++
>>> kernel/sched/core.c | 13 ++++++--
>>> kernel/sched/deadline.c | 74 ++++++++++++++++++++---------------------
>>> 3 files changed, 56 insertions(+), 40 deletions(-)
>>>
>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>> index 063cd120b459..db29ab492181 100644
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -571,6 +571,15 @@ struct sched_dl_entity {
>>> * time.
>>> */
>>> struct hrtimer inactive_timer;
>>> +
>>> +#ifdef CONFIG_RT_MUTEXES
>>> + /*
>>> + * Priority Inheritance. When a DEADLINE scheduling entity is boosted
>>> + * pi_se points to the donor, otherwise points to the dl_se it belongs
>>> + * to (the original one/itself).
>>> + */
>>> + struct sched_dl_entity *pi_se;
>>> +#endif
>>> };
>>>
>>> #ifdef CONFIG_UCLAMP_TASK
>>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>>> index 6f533bb7d3b9..e10aba6c363d 100644
>>> --- a/kernel/sched/core.c
>>> +++ b/kernel/sched/core.c
>>> @@ -4908,19 +4908,26 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task)
>>> (pi_task && dl_prio(pi_task->prio) &&
>>> dl_entity_preempt(&pi_task->dl, &p->dl))) {
>>> p->dl.dl_boosted = 1;
>>> + p->dl.pi_se = pi_task->dl.pi_se;
>>> queue_flag |= ENQUEUE_REPLENISH;
>>> - } else
>>> + } else {
>>> p->dl.dl_boosted = 0;
>>> + p->dl.pi_se = &p->dl;
>>> + }
>>> p->sched_class = &dl_sched_class;
>>> } else if (rt_prio(prio)) {
>>> - if (dl_prio(oldprio))
>>> + if (dl_prio(oldprio)) {
>>> p->dl.dl_boosted = 0;
>>> + p->dl.pi_se = &p->dl;
>>> + }
>>> if (oldprio < prio)
>>> queue_flag |= ENQUEUE_HEAD;
>>> p->sched_class = &rt_sched_class;
>>> } else {
>>> - if (dl_prio(oldprio))
>>> + if (dl_prio(oldprio)) {
>>> p->dl.dl_boosted = 0;
>>> + p->dl.pi_se = &p->dl;
>>> + }
>>
>> With this change, do we still need sched_dl_entity.dl_boosted? AIUI this
>> could become
>>
>> bool is_dl_boosted(struct sched_dl_entity *dl_se)
>> {
>> return pi_of(dl_se) != dl_se;
>> }
>
> Makes sense to me. I'll add this change as a separate patch if the rest
> makes sense to people as well. :-)

+1

-- Daniel

>
> Thanks for the quick review!
>
> Best,
> Juri
>