Re: [PATCH 0/6 v8] sched/fair: Add push task mechanism and handle more EAS cases

From: Pierre Gondois

Date: Thu Feb 26 2026 - 13:38:39 EST

On 12/2/25 19:12, Vincent Guittot wrote:

This is a subset of [1] (sched/fair: Rework EAS to handle more cases)

[1] https://lore.kernel.org/all/20250314163614.1356125-1-vincent.guittot@xxxxxxxxxx/

The current Energy Aware Scheduler has some known limitations which have
became more and more visible with features like uclamp as an example. This
serie tries to fix some of those issues:
- tasks stacked on the same CPU of a PD
- tasks stuck on the wrong CPU.

Following some other comments I think, I'm not sure I understand the use case
the patchset tries to solve.
- If this is for UCLAMP_MAX tasks:
As Christian said (somwhere) the utilization of a long running task doesn't
represent anything, so using EAS to do task placement cannot give a good
placement. The push mechanism effectively allows to down-migrate UCLAMP_MAX
tasks, but the repartition of these tasks is then subject to randomness.

On a Radxa Orion:
- 12 CPUs
- CPU[1-4] are little CPUs with capa=290
- using an artificial EM

Running 8 CPU-bound tasks with UCLAMP_MAX=100, the task placement can be:
- CPU1: 6 tasks
- CPU2: 1 task
- CPU3: 1 task
- CPU4: idle
The push mechanism triggers feec() and down-migrate tasks to little CPUs.
However doesn't balance the ratio of (load / capacity) between CPUs as the
load balancer could do. So the above placement is correct in that regard.

Another point is that it is hard to reason about what a 'fair' task placement
is for UCLAMP_MAX tasks as their throughput is limited on purpose.

The previous version of your patchset was trying to solve that issue,
but IMO this issue is inherent to UCLAMP_MAX setting. EAS doesn't
consider load during the task placement as all tasks are supposed
to be ~periodic and have wake-up events. CPUs are also supposed to have
some idle time, which guarantees that tasks are never really
starving, but UCLAMP_MAX contradicts this assumption.
With:
- Task[0-1]: NICE=-19, cpumask = CPUA,CPUB
- Task[2-3]: NICE=20, cpumask = CPUA,CPUB
The following task placement:
- CPUA: Task0 + Task1
- CPUB: Task2 + Task3
is fine for EAS, but sched_balance_find_dst_cpu() would do:
- CPUA: Task0 + Task2
- CPUB: Task1 + Task3
to balance the load, which is more 'fair'.

------------

- If this is to have better energy results by running feec() more often

You say later in the cover letter that other numbers would come
later, so I m curious to see the improvement.

Also I think that Christian mentioned somewhere the fact that
feec() is subject to concurrency. I quickly got some numbers and didn't see
a huge increase of concurrent decisions with the push mechanism,
but this indeed seems like something to worry about.

feec() is costly to run. I don't have any numbers to provide.

------------

- If this is to bail out of the OU state faster by migrating tasks to idle CPUs
or running feec() before a CPU is considered as overutilized

I can understand this point. When testing the patches, it seemed that
an inflating task still triggered the OU state.

Indeed other CPUs are going through a load balance through:

sched_balance_find_src_group()
\-update_sd_lb_stats
\-set_rd_overutilized()

and trigger the OU state, or through:

task_tick_fair()
\-check_pushable_task()
\-if (rq->nr_running > 1) -> return False
\-check_update_overutilized_status()

Also task_stuck_on_cpu() checks whether a single task fills the CPU capacity,
not whether the CPU utilization reaches the 80% threshold.

So I didn't see that much improvement on the OU front.
However as Qais noted, the load balancer is effectively quite slow to migrate
misfit tasks.

The patchset runs some checks on each sched tick and each time a rq switches
to another task. If the goal was to:
- non-EAS: push misfit tasks quickly
- EAS: avoid going in the OU state
this would already be a great improvement. I assume this would also allow to
remove the misfit handling code in the load balancer.

This would also mean extending the push mechanism to all HMP systems,
not just EAS-enabled systems.

------------

Summary:

- IMO UCLAMP_MAX tasks will always be an issue for EAS. Even if these tasks
were down-migrated, other issues would come up

- I'm interested in seeing energy consumption improvement numbers,
or other performance numbers.

- Following Qais (IIUC), the push mechanism could be useful to improve misfit task
migration latency and avoid going in the OU state. I tried to do some modifications
in that sense and didn't see any show stopper so far. This would also allow to
remove some code in the load balancer.

Patch 1 fixes the case where a CPU is wrongly classified as overloaded
whereas it is capped to a lower compute capacity. This wrong classification
can prevent periodic load balancer to select a group_misfit_task CPU
because group_overloaded has higher priority.

Patch 2 removes the need of testing uclamp_min in cpu_overutilized to
trigger the active migration of a task on another CPU.

Patch 3 prepares select_task_rq_fair() to be called without TTWU, Fork or
Exec flags when we just want to look for a possible better CPU.

Patch 4 adds push call back mecanism to fair scheduler but doesn't enable
it.

Patch 5 enable has_idle_core for !SMP system to track if there may be an
idle CPU in the LLC.

Patch 6 adds some conditions to enable pushing runnable tasks for EAS:
- when a task is stuck on a CPU and the system is not overutilized.
- if there is a possible idle CPU when the system is overutilized.

More tests results will come later as I wanted to send the pachtset before
LPC.

I have kept Tbench figures as I added them in v7 but results are the same
with the correct patch 6.

Tbench on dragonboard rb5
schedutil and EAS enabled

# process tip +patchset
1 29.3(+/-0.3%) 29.2(+/-0.2%) +0%
2 61.1(+/-1.8%) 61.7(+/-3.2%) +1%
4 260.0(+/-1.7%) 258.8(+/-2.8%) -1%
8 1361.2(+/-3.1%) 1377.1(+/-1.9%) +1%
16 981.5(+/-0.6%) 958.0(+/-1.7%) -2%

Hackbench didn't show any difference

Changes since v7:
- Rebased on latest tip/sched/core
- Fix some typos
- Fix patch 6 mess

Vincent Guittot (6):
sched/fair: Filter false overloaded_group case for EAS
sched/fair: Update overutilized detection
sched/fair: Prepare select_task_rq_fair() to be called for new cases
sched/fair: Add push task mechanism for fair
sched/fair: Enable idle core tracking for !SMT
sched/fair: Add EAS and idle cpu push trigger

kernel/sched/fair.c | 350 +++++++++++++++++++++++++++++++++++-----
kernel/sched/sched.h | 46 ++++--
kernel/sched/topology.c | 2 +
3 files changed, 345 insertions(+), 53 deletions(-)