Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue

From: Luis Machado
Date: Thu Sep 12 2024 - 08:59:18 EST


On 9/11/24 10:10, Mike Galbraith wrote:
> On Wed, 2024-09-11 at 10:45 +0200, Peter Zijlstra wrote:
>> On Wed, Sep 11, 2024 at 09:35:16AM +0100, Luis Machado wrote:
>>>>
>>>> I'm assuming that removing the usage sites restores function?
>>>
>>> It does restore function if we remove the usage.
>>>
>>> From an initial look:
>>>
>>> cat /sys/kernel/debug/sched/debug | grep -i delay                                                                                                                                                                                                                            
>>>   .h_nr_delayed                  : -4
>>>   .h_nr_delayed                  : -6
>>>   .h_nr_delayed                  : -1
>>>   .h_nr_delayed                  : -6
>>>   .h_nr_delayed                  : -1
>>>   .h_nr_delayed                  : -1
>>>   .h_nr_delayed                  : -5
>>>   .h_nr_delayed                  : -6
>>>
>>> So probably an unexpected decrement or lack of an increment somewhere.
>>
>> Yeah, that's buggered. Ok, I'll go rebase sched/core and take this patch
>> out. I'll see if I can reproduce that.
>
> Hm, would be interesting to know how the heck he's triggering that.
>
> My x86_64 box refuses to produce any such artifacts with anything I've
> tossed at it, including full LTP with enterprise RT and !RT configs,
> both in master and my local SLE15-SP7 branch. Hohum.
>
> -Mike

Ok, I seem to have narrowed this down to scheduler class switching. In particular
switched_from_fair.

Valentin's patch (75b6499024a6c1a4ef0288f280534a5c54269076
sched/fair: Properly deactivate sched_delayed task upon class change) introduced
finish_delayed_dequeue_entity, which takes care of cleaning up the state of delayed-dequeue
tasks during class change. Things work fine (minus delayed task accounting) up to this point.

When Peter introduced his patch to do h_nr_delayed accounting, we modified
finish_delayed_dequeue_entity to also call clear_delayed, instead of simply
zeroing se->sched_delayed.

The call to clear_delayed decrements the rq's h_nr_delayed, and it gets used elsewhere
to cleanup the state of delayed-dequeue tasks, in order to share some common code.

With that said, my testing on Android shows that when we hit switched_from_fair during
switching sched classes (due to some RT task), we're in a state where...

1 - We already called into dequeue_entities for this delayed task.
2 - We tested true for the !task_sleep && !task_delayed condition.
3 - se->sched_delayed is true, so h_nr_delayed == 1.
4 - We carry on executing the rest of dequeue_entities and decrement the rq's h_nr_running by 1.

In switched_from_fair, after the above events, we call into finish_delayed_dequeue_entity -> clear_delayed
and do yet another decrement to the rq's h_nr_delayed, now potentially making it negative. As
a consequence, we probably misuse the negative value and adjust the frequencies incorrectly. I
think this is the issue I'm seeing.

It is worth pointing out that even with the Android setup, things only go bad when there is enough
competition and switching of classes (lots of screen events etc).

My suggestion of a fix (below), still under testing, is to inline the delayed-dequeue and the lag zeroing
cleanup within switched_from_fair instead of calling finish_delayed_dequeue_entity. Or maybe
drop finish_delayed_dequeue_entity and inline its contents into its callers.

The rest of Peter's patch introducing h_nr_delayed seems OK as far as I could test.


diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f993ac282a83..f8df2f8d2e11 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13168,7 +13168,9 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
* related to sched_delayed being true and that wasn't done
* due to the generic dequeue not using DEQUEUE_DELAYED.
*/
- finish_delayed_dequeue_entity(&p->se);
+ p->se.sched_delayed = 0;
+ if (sched_feat(DELAY_ZERO) && p->se.vlag > 0)
+ p->se.vlag = 0;
p->se.rel_deadline = 0;
__block_task(rq, p);
}