Re: [RFC][PATCH 08/10] sched/fair: Implement delayed dequeue

From: Peter Zijlstra
Date: Wed May 29 2024 - 18:51:09 EST


On Thu, May 02, 2024 at 11:26:14AM +0100, Luis Machado wrote:

> Going through the code, my understanding is that the util_est functions seem to be getting
> called correctly, and in the right order. That is, we first util_est_enqueue, then util_est_dequeue
> and finally util_est_update. So the stats *should* be correct.
>
> On dequeuing (dequeue_task_fair), we immediately call util_est_dequeue, even for the case of
> a DEQUEUE_DELAYED task, since we're no longer going to run the dequeue_delayed task for now, even
> though it is still in the rq.
>
> We delay the util_est_update of dequeue_delayed tasks until a later time in dequeue_entities.
>
> Eventually the dequeue_delayed task will have its lag zeroed when it becomes eligible again,
> (requeue_delayed_entity) while still being in the rq. It will then get dequeued/enqueued (requeued),
> and marked as a non-dequeue-delayed task.
>
> Next time we attempt to enqueue such a task (enqueue_task_fair), it will skip the ENQUEUE_DELAYED
> block and call util_est_enqueue.
>
> Still, something seems to be signalling that util/load is high, and causing migration to the big cores.
>
> Maybe we're not decaying the util/load properly at some point, and inflated numbers start to happen.
>
> I'll continue investigating.

So I re-read all this util_est stuff again this evening and I am
confused :-) Please bear with me.

So the old code does:

dequeue_task_fair()
util_est_dequeue();
// actual dequeue -- updates pelt
util_est_update();


While the new code does:

dequeue_task_fair()
util_est_dequeue();
if (!dequeue())
return;
util_est_update();

delayed_dequeue:
util_est_update();


Specifically, we only call util_est_update() if/when we do the actual
dequeue -- potentially at a later point in time. Because, as I argued in
the comments, ttwu()'s ENQUEUE_DELAYED will not actually enqueue the
task (since it is still enqueued) and therefore the update would be
spurious.

However!!, if we do dequeue, then we'll end up updating the EWMA with a
potentially different task_util(p).

And this is where the confusion comes... this extra time on the runqueue
would not be running and thus decreate util_avg, as such task_util_est()
should be lower than before and tasks should tend towards smaller cores,
rather than larger cores as you seem to be seeing.

[ there is the 'dequeued + UTIL_EST_MARGIN < task_runnable()' escape
clause, which we might be hitting... dunno ]

In any case, if you haven't tried it already, could I ask you to test
the below (once you're back in the office)?

Also, this doesn't really explain why things go sideways once you enable
DELAY_DEQUEUE and then don't re-align if you disable it again. I mean,
it should eventually age out the dodgy EWMA contributions and start
working as expected.

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7008,7 +7008,6 @@ static int dequeue_entities(struct rq *r
SCHED_WARN_ON(p->on_rq != 1);

/* Fix-up what dequeue_task_fair() skipped */
- util_est_update(&rq->cfs, p, task_sleep);
hrtick_update(rq);

/* Fix-up what block_task() skipped. */
@@ -7028,13 +7027,11 @@ static bool dequeue_task_fair(struct rq
{
util_est_dequeue(&rq->cfs, p);

- if (dequeue_entities(rq, &p->se, flags) < 0)
+ if (dequeue_entities(rq, &p->se, flags) < 0) {
+ util_est_update(&rq->cfs, p, DEQUEUE_SLEEP);
return false;
+ }

- /*
- * It doesn't make sense to update util_est for the delayed dequeue
- * case where ttwu will make it appear the sleep never happened.
- */
util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
hrtick_update(rq);
return true;