Re: [PATCH 3/7 v5] sched/fair: Rework feec() to use cost instead of spare capacity

From: Pierre Gondois
Date: Sun Mar 16 2025 - 16:22:16 EST




On 3/14/25 17:24, Vincent Guittot wrote:
On Wed, 12 Mar 2025 at 15:09, Pierre Gondois <pierre.gondois@xxxxxxx> wrote:

Hello Vincent,

On 3/2/25 22:05, Vincent Guittot wrote:
feec() looks for the CPU with highest spare capacity in a PD assuming that
it will be the best CPU from a energy efficiency PoV because it will
require the smallest increase of OPP. Although this is true generally
speaking, this policy also filters some others CPUs which will be as
efficients because of using the same OPP.
In fact, we really care about the cost of the new OPP that will be
selected to handle the waking task. In many cases, several CPUs will end
up selecting the same OPP and as a result using the same energy cost. In
these cases, we can use other metrics to select the best CPU for the same
energy cost.

Rework feec() to look 1st for the lowest cost in a PD and then the most
performant CPU between CPUs. The cost of the OPP remains the only
comparison criteria between Performance Domains.

Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
---
kernel/sched/fair.c | 466 +++++++++++++++++++++++---------------------
1 file changed, 246 insertions(+), 220 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d3d1a2ba6b1a..a9b97bbc085f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c

[...]

+static bool update_best_cpu(struct energy_cpu_stat *target,
+ struct energy_cpu_stat *min,
+ int prev, struct sched_domain *sd)
{
- unsigned long max_util = eenv_pd_max_util(eenv, pd_cpus, p, dst_cpu);
- unsigned long busy_time = eenv->pd_busy_time;
- unsigned long energy;
-
- if (dst_cpu >= 0)
- busy_time = min(eenv->pd_cap, busy_time + eenv->task_busy_time);
+ /* Select the one with the least number of running tasks */
+ if (target->nr_running < min->nr_running)
+ return true;
+ if (target->nr_running > min->nr_running)
+ return false;

- energy = em_cpu_energy(pd->em_pd, max_util, busy_time, eenv->cpu_cap);
+ /* Favor previous CPU otherwise */
+ if (target->cpu == prev)
+ return true;
+ if (min->cpu == prev)
+ return false;

- trace_sched_compute_energy_tp(p, dst_cpu, energy, max_util, busy_time);
+ /*
+ * Choose CPU with lowest contention. One might want to consider load
+ * instead of runnable but we are supposed to not be overutilized so
+ * there is enough compute capacity for everybody.
+ */

I'm not sure I understand the comment. With UCLAMP_MAX tasks, a CPU can lack
compute capacity while not being tagged as overutilized. IIUC this is actually
the goal of UCLAMP_MAX.

the uclamp_max says that the task doesn't need more than a compute
capacity of 1 so there is no lack

If UCLAMP_MAX was a bandwidth hint and the task was requesting 1/1024 ~= 0.1%
of the bandwidth, then effectively there would be no lack.
However if this is a performance hint, then the CPU will run at the lowest
available performance level of the CPU. I don't think there is any guarantee
on the bandwidth in that case.



With the following workload:
- 2 tasks A with duty_cycle=30%, UCLAMP_MIN/MAX=(0,1), niceness=0
- 2 tasks B with duty_cycle=70%, UCLAMP_MIN/MAX=(0,1), niceness=-10

Does the duty cycle make any difference here ? they won't be able to
run 30% or 70% anyway because of their uclamp_max, will they ?

No indeed, this will make no difference.


The workload runs on a Pixel6 with a reduced cpuset of [1,2,7], i.e. 2 little
CPUs (1,2) capa=160 and one big CPU (7) capa=1024.
CPU7 is avoided by the tasks as their UCLAMP_MAX setting make them fit on the
little CPUs.

select_best_cpu() prefers to place tasks based on nr_running. If the 2 tasks A
end up being placed on one little CPU, and the 2 tasks B are placed on the
other little CPU, feec() is theoretically unable to balance the workload.

They will all have 80 compute capacity which is more than their uclamp_max= 1

Cf. above, if UCLAMP_MAX is a performance hint, then there should be no guarantee
on the received bandwidth. If we take the same example with UCLAMP_MAX=81, then
this should still hold. However the received bandwidth will be of 80 util.

(Small note that with UCLAMP_MAX=1, the little CPUs should run at a performance
level lower than 160. But the example should still hold if the little CPUs only
have one OPP at 160.)

------

To go further in that direction, balancing with the number of tasks + runnable
seems like a similar version of the load balancer's group_overloaded case. We're
trying to share the bandwidth equally among UCLAMP_MAX tasks, but without taking
into account the niceness of tasks.
EAS didn't have to handle the case of compute capacity shortage until UCLAMP_MAX
came in: EAS was only active when there was enough compute capacity due to the
overutilized threshold.

Doing load balancing among UCLAMP_MAX tasks in feec() seems like a complex
exercise: what if the pd topology is: 2 little CPUs + 2 little CPUs + 1 big CPU.
AFAICS, UCLAMP_MAX tasks could all start on one of the little CPUs' pd and never
be balanced toward the second little CPU's pd as we are looking for the CPU with
the lowest #tasks inside one pd (assuming that all little CPUs are running at
the same, lowest OPP).


In practice, a kworker ends up spawning on one of these 2 little CPUs and tasks
are shuffled, so the pattern breaks after ~30ms.

This pattern seems problematic as tasks A are:
- smaller (30% < 70%)
- nicer (0 > -10)
than tasks B. So I assume the correct task placement should be one task of each
type on each little CPU.

------

There are some comments in the load balancer code:
1.
/* Computing avg_load makes sense only when group is overloaded */
2.
/*
* Computing avg_load makes sense only when group is fully busy or
* overloaded
*/

IIUC, the load is only meaningful when there is not enough compute capacity
to estimate the task size, otherwise util_avg makes more sense. It seems that
when it comes to UCLAMP_MAX task, CPUs are placed in this exact situation:
load_avg makes more sense that util_avg.
However, in this situation, energy computations also lose sense since they
are based on the util_avg values.

------

select_best_cpu() could check the CPU load before checking nr_running, but it
would be meaningless if there is enough CPU time for all the tasks.

Maybe CPU load should then be checked only if the system doesn't have enough
CPU time. But this would be equivalent to:
- remove UCLAMP_MAX in cpu_overutilized()
- when the system is overutilized (no UCLAMP_MAX involved), go back to the
load balancer

In other words, I don't really see how it is possible to reconciliate UCLAMP_MAX
tasks with EAS as EAS relies on util_avg values, and UCLAMP_MAX forces to
rely on load_avg value rather than util_avg.

Regards,
Pierre


+ if ((target->runnable * min->capa * sd->imbalance_pct) >=
+ (min->runnable * target->capa * 100))
+ return false;

- return energy;
+ return true;
}