Re: [RFC][PATCH 08/10] sched/fair: Implement delayed dequeue

From: Luis Machado
Date: Mon Jun 03 2024 - 15:31:26 EST


On 5/29/24 23:50, Peter Zijlstra wrote:
> On Thu, May 02, 2024 at 11:26:14AM +0100, Luis Machado wrote:
>
>> Going through the code, my understanding is that the util_est functions seem to be getting
>> called correctly, and in the right order. That is, we first util_est_enqueue, then util_est_dequeue
>> and finally util_est_update. So the stats *should* be correct.
>>
>> On dequeuing (dequeue_task_fair), we immediately call util_est_dequeue, even for the case of
>> a DEQUEUE_DELAYED task, since we're no longer going to run the dequeue_delayed task for now, even
>> though it is still in the rq.
>>
>> We delay the util_est_update of dequeue_delayed tasks until a later time in dequeue_entities.
>>
>> Eventually the dequeue_delayed task will have its lag zeroed when it becomes eligible again,
>> (requeue_delayed_entity) while still being in the rq. It will then get dequeued/enqueued (requeued),
>> and marked as a non-dequeue-delayed task.
>>
>> Next time we attempt to enqueue such a task (enqueue_task_fair), it will skip the ENQUEUE_DELAYED
>> block and call util_est_enqueue.
>>
>> Still, something seems to be signalling that util/load is high, and causing migration to the big cores.
>>
>> Maybe we're not decaying the util/load properly at some point, and inflated numbers start to happen.
>>
>> I'll continue investigating.

Apologies for not getting this earlier, but I wanted to collect as much
data as possible so we can hopefully see what's going wrong with this
particular patch in the series.

(picking up from the previous questions on find_energy_efficient_cpu and
overutilized)

For the rate of calls to find_energy_efficient_cpu, I checked again and
I didn't see any significant differences there. With delayed dequeue
or without delayed dequeue, I see almost the same number of calls,
with the expected small variation between runs.

As for the overutilized part, there is a bit of a change within
find_energy_efficient_cpu. When I first enable delayed dequeue via
the debugfs (I boot up the system with it disabled for a clean state),
I start to see a lot more cases of us bailing out of
find_energy_efficient_cpu due to the overutilized condition.

Strangely though, things seem to stabilize after a while, and the
rate of overutilized conditions when within find_energy_efficient_cpu
goes down to normal levels.

Here's some numbers from a trace session:

Delayed dequeue off (clean state, 2 iterations):
------------------------------------------------

Calls to find_energy_efficient_cpu: 211087
Overutilized when in find_energy_efficient_cpu: 36224

Calls to find_energy_efficient_cpu: 205804
Overutilized when in find_energy_efficient_cpu: 35274


Delayed dequeue on (3 iterations):
------------------------------------------------

Calls to find_energy_efficient_cpu: 204079
Overutilized when in find_energy_efficient_cpu: 86966

Calls to find_energy_efficient_cpu: 207351
Overutilized when in find_energy_efficient_cpu: 111233

Calls to find_energy_efficient_cpu: 208556
Overutilized when in find_energy_efficient_cpu: 77986


Delayed dequeue off again (3 iterations):
------------------------------------------------

Calls to find_energy_efficient_cpu: 210396
Overutilized when in find_energy_efficient_cpu: 34108


Calls to find_energy_efficient_cpu: 210978
Overutilized when in find_energy_efficient_cpu: 33122

Calls to find_energy_efficient_cpu: 210181
Overutilized when in find_energy_efficient_cpu: 33060


Delayed dequeue on again (3 iterations):
------------------------------------------------

Calls to find_energy_efficient_cpu: 202932
Overutilized when in find_energy_efficient_cpu: 35060

Calls to find_energy_efficient_cpu: 206098
Overutilized when in find_energy_efficient_cpu: 57373

Calls to find_energy_efficient_cpu: 199444
Overutilized when in find_energy_efficient_cpu: 33959


On the matter of switched_from_fair() being called with a task
that is sched_delayed == 1, I can confirm I see some occurrences
of that. They are not that numerous, but do happen from time to
time on the trace data. I haven't investigated things in that
direction yet though.

But we do make those tasks sched_delayed = 0 in there, so things
should go back to normal eventually, unless we fail to dequeue
the sched_delayed task properly for this case.

>
> So I re-read all this util_est stuff again this evening and I am
> confused :-) Please bear with me.
>
> So the old code does:
>
> dequeue_task_fair()
> util_est_dequeue();
> // actual dequeue -- updates pelt
> util_est_update();
>
>
> While the new code does:
>
> dequeue_task_fair()
> util_est_dequeue();
> if (!dequeue())
> return;
> util_est_update();
>
> delayed_dequeue:
> util_est_update();
>
>

The above is my understanding as well.

> Specifically, we only call util_est_update() if/when we do the actual
> dequeue -- potentially at a later point in time. Because, as I argued in
> the comments, ttwu()'s ENQUEUE_DELAYED will not actually enqueue the
> task (since it is still enqueued) and therefore the update would be
> spurious.
>
> However!!, if we do dequeue, then we'll end up updating the EWMA with a
> potentially different task_util(p).
>
> And this is where the confusion comes... this extra time on the runqueue
> would not be running and thus decreate util_avg, as such task_util_est()
> should be lower than before and tasks should tend towards smaller cores,
> rather than larger cores as you seem to be seeing.

Right. I also expected the util_avg to decrease with more time on the
runqueue. And that might be happening...

>
> [ there is the 'dequeued + UTIL_EST_MARGIN < task_runnable()' escape
> clause, which we might be hitting... dunno ]
>
> In any case, if you haven't tried it already, could I ask you to test
> the below (once you're back in the office)?
>
> Also, this doesn't really explain why things go sideways once you enable
> DELAY_DEQUEUE and then don't re-align if you disable it again. I mean,
> it should eventually age out the dodgy EWMA contributions and start
> working as expected.
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7008,7 +7008,6 @@ static int dequeue_entities(struct rq *r
> SCHED_WARN_ON(p->on_rq != 1);
>
> /* Fix-up what dequeue_task_fair() skipped */
> - util_est_update(&rq->cfs, p, task_sleep);
> hrtick_update(rq);
>
> /* Fix-up what block_task() skipped. */
> @@ -7028,13 +7027,11 @@ static bool dequeue_task_fair(struct rq
> {
> util_est_dequeue(&rq->cfs, p);
>
> - if (dequeue_entities(rq, &p->se, flags) < 0)
> + if (dequeue_entities(rq, &p->se, flags) < 0) {
> + util_est_update(&rq->cfs, p, DEQUEUE_SLEEP);
> return false;
> + }
>
> - /*
> - * It doesn't make sense to update util_est for the delayed dequeue
> - * case where ttwu will make it appear the sleep never happened.
> - */
> util_est_update(&rq->cfs, p, flags & DEQUEUE_SLEEP);
> hrtick_update(rq);
> return true;

... as the above patch unfortunately doesn't improve things. The util_est_update
might be a red herring here.

Exchanging some information with Hongyan today, he was a bit suspicious of the uclamp
behavior with the eevdf complete series applied.

Checking the uclamp code, I see we have some refcounting tied to enqueuing/dequeuing
of tasks, and the uclamp values are organized in buckets.

Just for fun I added a few trace_printk's in uclamp_eff_value, uclamp_rq_inc_id and
uclamp_rq_dec_id.

Booting up the system with delayed_dequeue disabled and running the benchmark, I
see the uclamp bucket management pretty stable. Tasks get added to the uclamp
buckets but then get removed. At the end of the benchmark, the uclamp buckets
are (almost always) clean of tasks.

Enabling delayed dequeue, I can see the uclamp buckets slowly filling up with
tasks. At the end of the benchmark, I see uclamp buckets with 30, 40 or 50
tasks still. If I do another run, I can see 80, 100 tasks still.

I suspect refcounting might be going wrong somewhere due to delayed dequeue
tasks, but that's more of a guess right now. Hopefully that is useful
information. I'll resume investigation tomorrow.