Re: [RFC][PATCH 08/10] sched/fair: Implement delayed dequeue

From: Luis Machado
Date: Tue Jun 04 2024 - 10:24:39 EST


On 6/4/24 11:11, Peter Zijlstra wrote:
> On Mon, Jun 03, 2024 at 08:30:43PM +0100, Luis Machado wrote:
>
>> Exchanging some information with Hongyan today, he was a bit suspicious of the uclamp
>> behavior with the eevdf complete series applied.
>>
>> Checking the uclamp code, I see we have some refcounting tied to enqueuing/dequeuing
>> of tasks, and the uclamp values are organized in buckets.
>>
>> Just for fun I added a few trace_printk's in uclamp_eff_value, uclamp_rq_inc_id and
>> uclamp_rq_dec_id.
>>
>> Booting up the system with delayed_dequeue disabled and running the benchmark, I
>> see the uclamp bucket management pretty stable. Tasks get added to the uclamp
>> buckets but then get removed. At the end of the benchmark, the uclamp buckets
>> are (almost always) clean of tasks.
>>
>> Enabling delayed dequeue, I can see the uclamp buckets slowly filling up with
>> tasks. At the end of the benchmark, I see uclamp buckets with 30, 40 or 50
>> tasks still. If I do another run, I can see 80, 100 tasks still.
>>
>> I suspect refcounting might be going wrong somewhere due to delayed dequeue
>> tasks, but that's more of a guess right now. Hopefully that is useful
>> information. I'll resume investigation tomorrow.
>
> Thank you both!!
>
> Does the below help?
>
> Note how dequeue_task() does uclamp_rq_dec() unconditionally, which is
> then not balanced in the case below.
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3664,6 +3664,7 @@ static int ttwu_runnable(struct task_str
> /* mustn't run a delayed task */
> SCHED_WARN_ON(task_on_cpu(rq, p));
> enqueue_task(rq, p, ENQUEUE_DELAYED);
> + uclamp_rq_inc(rq, p);
> }
> if (!task_on_cpu(rq, p)) {
> /*

As Hongyan pointed out in a separate message, the above makes things
worse, as we end up with even more leftover tasks in the uclamp
buckets.

I'm trying a fix in kernel/sched/core.c:enqueue_task that only
calls uclamp_rq_inc if the task is not sched_delayed, so:

- uclamp_rq_inc(rq, p);
+ if (!p->se.sched_delayed)
+ uclamp_rq_inc(rq, p);

I'm not entirely sure it is correct, but it seems to fix things,
but I'm still running some tests.

With the current code, given uclamp_rq_inc and uclamp_rq_dec get
called in enqueue_task and dequeue_task, the additional enqueue_task
call from ttwu_runnable for a delayed_dequeue task may do an additional
unconditional call to uclamp_rq_inc, no?