Re: [RFC][PATCH 08/10] sched/fair: Implement delayed dequeue

From: Hongyan Xia
Date: Tue Jun 04 2024 - 10:54:05 EST


On 04/06/2024 15:23, Luis Machado wrote:
On 6/4/24 11:11, Peter Zijlstra wrote:
On Mon, Jun 03, 2024 at 08:30:43PM +0100, Luis Machado wrote:

Exchanging some information with Hongyan today, he was a bit suspicious of the uclamp
behavior with the eevdf complete series applied.

Checking the uclamp code, I see we have some refcounting tied to enqueuing/dequeuing
of tasks, and the uclamp values are organized in buckets.

Just for fun I added a few trace_printk's in uclamp_eff_value, uclamp_rq_inc_id and
uclamp_rq_dec_id.

Booting up the system with delayed_dequeue disabled and running the benchmark, I
see the uclamp bucket management pretty stable. Tasks get added to the uclamp
buckets but then get removed. At the end of the benchmark, the uclamp buckets
are (almost always) clean of tasks.

Enabling delayed dequeue, I can see the uclamp buckets slowly filling up with
tasks. At the end of the benchmark, I see uclamp buckets with 30, 40 or 50
tasks still. If I do another run, I can see 80, 100 tasks still.

I suspect refcounting might be going wrong somewhere due to delayed dequeue
tasks, but that's more of a guess right now. Hopefully that is useful
information. I'll resume investigation tomorrow.

Thank you both!!

Does the below help?

Note how dequeue_task() does uclamp_rq_dec() unconditionally, which is
then not balanced in the case below.

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3664,6 +3664,7 @@ static int ttwu_runnable(struct task_str
/* mustn't run a delayed task */
SCHED_WARN_ON(task_on_cpu(rq, p));
enqueue_task(rq, p, ENQUEUE_DELAYED);
+ uclamp_rq_inc(rq, p);
}
if (!task_on_cpu(rq, p)) {
/*

As Hongyan pointed out in a separate message, the above makes things
worse, as we end up with even more leftover tasks in the uclamp
buckets.

I'm trying a fix in kernel/sched/core.c:enqueue_task that only
calls uclamp_rq_inc if the task is not sched_delayed, so:

- uclamp_rq_inc(rq, p);
+ if (!p->se.sched_delayed)
+ uclamp_rq_inc(rq, p);

I'm not entirely sure it is correct, but it seems to fix things,
but I'm still running some tests.

This seems to so far not trigger any WARN or bad behavior in my tests.


With the current code, given uclamp_rq_inc and uclamp_rq_dec get
called in enqueue_task and dequeue_task, the additional enqueue_task
call from ttwu_runnable for a delayed_dequeue task may do an additional
unconditional call to uclamp_rq_inc, no?