Re: [PATCH 10/24] sched/uclamg: Handle delayed dequeue

From: Hongyan Xia
Date: Tue Aug 20 2024 - 12:23:16 EST


On 27/07/2024 11:27, Peter Zijlstra wrote:
Delayed dequeue has tasks sit around on the runqueue that are not
actually runnable -- specifically, they will be dequeued the moment
they get picked.

One side-effect is that such a task can get migrated, which leads to a
'nested' dequeue_task() scenario that messes up uclamp if we don't
take care.

Notably, dequeue_task(DEQUEUE_SLEEP) can 'fail' and keep the task on
the runqueue. This however will have removed the task from uclamp --
per uclamp_rq_dec() in dequeue_task(). So far so good.

However, if at that point the task gets migrated -- or nice adjusted
or any of a myriad of operations that does a dequeue-enqueue cycle --
we'll pass through dequeue_task()/enqueue_task() again. Without
modification this will lead to a double decrement for uclamp, which is
wrong.

Reported-by: Luis Machado <luis.machado@xxxxxxx>
Reported-by: Hongyan Xia <hongyan.xia2@xxxxxxx>
Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
---
kernel/sched/core.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1676,6 +1676,9 @@ static inline void uclamp_rq_inc(struct
if (unlikely(!p->sched_class->uclamp_enabled))
return;
+ if (p->se.sched_delayed)
+ return;
+
for_each_clamp_id(clamp_id)
uclamp_rq_inc_id(rq, p, clamp_id);
@@ -1700,6 +1703,9 @@ static inline void uclamp_rq_dec(struct
if (unlikely(!p->sched_class->uclamp_enabled))
return;
+ if (p->se.sched_delayed)
+ return;
+
for_each_clamp_id(clamp_id)
uclamp_rq_dec_id(rq, p, clamp_id);
}
@@ -1979,8 +1985,12 @@ void enqueue_task(struct rq *rq, struct
psi_enqueue(p, (flags & ENQUEUE_WAKEUP) && !(flags & ENQUEUE_MIGRATED));
}
- uclamp_rq_inc(rq, p);
p->sched_class->enqueue_task(rq, p, flags);
+ /*
+ * Must be after ->enqueue_task() because ENQUEUE_DELAYED can clear
+ * ->sched_delayed.
+ */
+ uclamp_rq_inc(rq, p);

Apart from the typo in the title, this is a notable functional change.

Both classes that support uclamp update the CPU frequency in enqueue_task(). Before, a task that have uclamp_min will immediately drive up the frequency the moment it is enqueued. Now, driving up the frequency is delayed until the next util update.

I do not yet have evidence suggesting this is quantitatively bad, like first frame drops, but we might want to keep an eye on this, and switch back to the old way if possible.

if (sched_core_enabled(rq))
sched_core_enqueue(rq, p);
@@ -2002,6 +2012,10 @@ inline bool dequeue_task(struct rq *rq,
psi_dequeue(p, flags & DEQUEUE_SLEEP);
}
+ /*
+ * Must be before ->dequeue_task() because ->dequeue_task() can 'fail'
+ * and mark the task ->sched_delayed.
+ */
uclamp_rq_dec(rq, p);
return p->sched_class->dequeue_task(rq, p, flags);
}