Re: [PATCH 3/7 v5] sched/fair: Rework feec() to use cost instead of spare capacity
From: Pierre Gondois
Date: Tue Mar 25 2025 - 07:13:14 EST
Hello Vincent,
On 3/2/25 22:05, Vincent Guittot wrote:
feec() looks for the CPU with highest spare capacity in a PD assuming that
it will be the best CPU from a energy efficiency PoV because it will
require the smallest increase of OPP. Although this is true generally
speaking, this policy also filters some others CPUs which will be as
efficients because of using the same OPP.
In fact, we really care about the cost of the new OPP that will be
selected to handle the waking task. In many cases, several CPUs will end
up selecting the same OPP and as a result using the same energy cost. In
these cases, we can use other metrics to select the best CPU for the same
energy cost.
Rework feec() to look 1st for the lowest cost in a PD and then the most
performant CPU between CPUs. The cost of the OPP remains the only
comparison criteria between Performance Domains.
Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
---
kernel/sched/fair.c | 466 +++++++++++++++++++++++---------------------
1 file changed, 246 insertions(+), 220 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d3d1a2ba6b1a..a9b97bbc085f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
[...]
- if ((best_fits > prev_fits) ||
- ((best_fits > 0) && (best_delta < prev_delta)) ||
- ((best_fits < 0) && (best_actual_cap > prev_actual_cap)))
- target = best_energy_cpu;
+ /* Add the energy cost of p */
+ delta_nrg = task_util * min_stat.cost;
Not capping task_util to the target CPU's max capacity when computing energy
means this is ok to have fully-utilized CPUs.
Just to re-advertise what can happen with utilization values of CPUs that are
fully utilized:
On a Pixel6, placing 3 tasks with:
period=16ms, duty_cycle=100%, UCLAMP_MAX=700, cpuset=5,6
I.e. the tasks can run on one medium/big core. UCLAMP_MAX is set such as the
system doesn't become overutilized.
Tasks are going to bounce on the medium CPU as, if we follow one task:
1. The task will grow to reach a maximum utilization of ~340 (=1024/3 due to
the pressure of other tasks)
2. As the big CPU is less energy efficient than the medium CPU, the big CPU
will eventually reach an OPP where it will be better to run on the medium CPU
energy-wise
3. The task is migrated to the medium CPU. However the task can now grow
its utilization since there is no pressure from other tasks. So the
utilization of the task slowly grows.
4. The medium CPU reaches on OPP where it is more energy efficient to migrate
the task on the big CPU. We can go to step 1.
As the utilization is only a reflection of how much CPU time the task received,
util_avg depends on the #tasks/niceness of the co-scheduled tasks. Thus, it is
really hard to make any assumption based on this value. UCLAMP_MAX puts tasks
in the exact situation where util_avg doesn't represent the size of the task
anymore.
------
In this example, niceness is not taken into account, but cf. the other thread,
other scenarios could be created where the task placement is incorrect/un-stable.
EAS mainly relies on util_avg, so having correct task estimations should be the
priority.
------
As you and Christian I think briefly evoked as an idea, setting the SCHED_IDLE
policy to UCLAMP_MAX could maybe help. However, this implies:
1. not being able to set UCLAMP_MIN and UCLAMP_MAX at the same time for a task
Indeed, as someone might want to have SCHED_NORMAL for it UCLAMP_MIN task
2. not using feec() for UCLAMP_MAX tasks.
Indeed, SCHED_IDLE tasks will likely have wrong util_avg values since they don't
have much running time. So doing energy computation using these values would not
work.
Another advantage is that UCLAMP_MAX tasks will impact the util_avg of
non-UCLAMP_MAX tasks only lightly as their running time will be really low.
Balancing CPUs based on h_nr_idle rather than h_nr_runnable would also allow
to have ~the same number of UCLAMP_MAX tasks on all CPUs of the same
type/capacity.
------
At the risk of being a bit insistant, I don't see how UCLAMP_MAX tasks could
be placed using EAS. Either EAS or UCLAMP_MAX needs to change...