Re: [PATCH 0/6 v8] sched/fair: Add push task mechanism and handle more EAS cases

From: Pierre Gondois

Date: Tue Mar 10 2026 - 06:31:49 EST

On 3/10/26 05:16, Qais Yousef wrote:

On 02/26/26 18:34, Pierre Gondois wrote:

On 12/2/25 19:12, Vincent Guittot wrote:

This is a subset of [1] (sched/fair: Rework EAS to handle more cases)

[1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@xxxxxxxxxx/

The current Energy Aware Scheduler has some known limitations which have
became more and more visible with features like uclamp as an example. This
serie tries to fix some of those issues:
- tasks stacked on the same CPU of a PD
- tasks stuck on the wrong CPU.

Following some other comments I think, I'm not sure I understand the use
case
the patchset tries to solve.
- If this is for UCLAMP_MAX tasks:
As Christian said (somwhere) the utilization of a long running task doesn't
represent anything, so using EAS to do task placement cannot give a good
placement. The push mechanism effectively allows to down-migrate UCLAMP_MAX
tasks, but the repartition of these tasks is then subject to randomness.

Why randomness? We should distribute within the same perf domain, no?

Yes right, but cf. the example below, UCLAMP_MAX tasks will be
distributed regardless of the load.

On a Radxa Orion:
- 12 CPUs
- CPU[1-4] are little CPUs with capa=290
- using an artificial EM

Running 8 CPU-bound tasks with UCLAMP_MAX=100, the task placement can be:
- CPU1: 6 tasks
- CPU2: 1 task
- CPU3: 1 task
- CPU4: idle
The push mechanism triggers feec() and down-migrate tasks to little CPUs.
However doesn't balance the ratio of (load / capacity) between CPUs as the
load balancer could do. So the above placement is correct in that regard.

Hmm. Energy should tell us which perf domain is cheaper. But within the same
perf domain we pick the CPU with the most spare capacity.

Do all the CPUs appear loaded with max_spare_cap = 0?

Yes, as they all have no spare cycle. This results in prev_cpu being picked.
In a way feec() does its job: this is a correct placement energy-wise.
However feec() wasn't made to handle cases where utilization is not
reliable.

Worth noting as part of looking at enabling overloaded support, it is important
to look at nr_running which I think something we should look at as we evolve
this handling. But for now, I think max_spare_cap checks should distribute
within a perf domain. nr_running will handle this more gracefully which is
trivial to add later for feec(). But ideally we want all wake up code to look
at nr_running and I think better defer it to after initial merge.

If we have 2 little CPUs (CPU0/CPU1) with 4 tasks:
- TaskA: Nice=10 (i.e. weight=110)
- Task[B,C,D]: Nice=15 (i.e. weight=36)

Then using nr_running would yield a placement as with 2 tasks
on each CPU:
- CPU0: TaskA + TaskB
Total weight = 110 + 36 = 146
- CPU1: TaskC + TaskD
Total weight = 36 + 36 = 52
With such placement:
- TaskA and TaskB are receiving less throughput
- TaskC and TaskD are receiving more throughput
than what they would if the placement was balanced.

This is not compliant with the scheduler Nice interface.
Also the documentation of UCLAMP states that it should only
be treated as hints.

A more balanced placement is:
- CPU0: TaskA
Total weight = 110
- CPU1: TaskB + TaskC + TaskD
Total weight = 36 + 36 + 36 = 88

The previous versions of Vincent's patchset was already using
nr_running to help balancing UCLAMP_MAX tasks in feec() IIRC.
However this will likely lead to the creation of a second
load balancer in feec(), as the example above shows.

------------

The push mechanism allows to down-migrate UCLAMP_MAX tasks,
which is indeed a better handling of UCLAMP_MAX. However it is
likely the first step toward more complicated issues.

IMO the best way to handle UCLAMP_MAX tasks would be to make
them second-class tasks, as the documentation describes them:
"""
Like explained for Android case in the introduction. Any app can lower
UCLAMP_MAX for some background tasks that don't care about performance
but could end up being busy and consume unnecessary system resources
on the system.
"""
But this would require having QoS classes for fair tasks and
this is also a large and complex problem.

Another solution would be to force the policy of every
UCLAMP_MAX task to SCHED_IDLE. This would also allow to just balance
the number of h_nr_idle on each CPU, as you and Vincent want to do
IIUC. Indeed if tasks have the same weights, the example above
doesn't hold anymore.

Using SCHED_IDLE for UCLAMP_MAX tasks can be viewed as a cheap
implementation of a lower QoS task class. Their priority is lower than
'normal' CFS classes (i.e. without UCLAMP_MAX set) and they cannot
steal time from 'normal' tasks.
But:
- the higher the Nice value of a task, the less true it becomes.
- as UCLAMP_MAX tasks and normal CFS tasks are still part of the same
'class', they are competing for CPU time on the same level.
Thus UCLAMP_MAX tasks cannot be made 'background tasks' and actually
run on spare CPU cycles (and avoid going in the over-utilized state).

------------

So IMO placement of UCLAMP_MAX tasks can only be achieved once QoS
classes are implemented. The push mechanism is still a good idea for
misfit/overutilized handling (IMO).