Re: [PATCH 0/7 v5] sched/fair: Rework EAS to handle more cases

From: Christian Loehle
Date: Mon Mar 24 2025 - 12:42:22 EST


On 3/2/25 21:05, Vincent Guittot wrote:
> The current Energy Aware Scheduler has some known limitations which have
> became more and more visible with features like uclamp as an example. This
> serie tries to fix some of those issues:
> - tasks stacked on the same CPU of a PD
> - tasks stuck on the wrong CPU.
>
> Patch 1 fixes the case where a CPU is wrongly classified as overloaded
> whereas it is capped to a lower compute capacity. This wrong classification
> can prevent periodic load balancer to select a group_misfit_task CPU
> because group_overloaded has higher priority.
>
> Patch 2 creates a new EM interface that will be used by Patch 3
>
> Patch 3 fixes the issue of tasks being stacked on same CPU of a PD whereas
> others might be a better choice. feec() looks for the CPU with the highest
> spare capacity in a PD assuming that it will be the best CPU from a energy
> efficiency PoV because it will require the smallest increase of OPP.
> This is often but not always true, this policy filters some others CPUs
> which would be as efficients because of using the same OPP but with less
> running tasks as an example.
> In fact, we only care about the cost of the new OPP that will be
> selected to handle the waking task. In many cases, several CPUs will end
> up selecting the same OPP and as a result having the same energy cost. In
> such cases, we can use other metrics to select the best CPU with the same
> energy cost. Patch 3 rework feec() to look 1st for the lowest cost in a PD
> and then the most performant CPU between CPUs. At now, this only tries to
> evenly spread the number of runnable tasks on CPUs but this can be
> improved with other metric like the sched slice duration in a follow up
> series.
>
> perf sched pipe on a dragonboard rb5 has been used to compare the overhead
> of the new feec() vs current implementation.
>
> 9 iterations of perf bench sched pipe -T -l 80000
> ops/sec stdev
> tip/sched/core 16634 (+/- 0.5%)
> + patches 1-3 17434 (+/- 1.2%) +4.8%
>
>
> Patch 4 removed the now unused em_cpu_energy()
>
> Patch 5 solves another problem with tasks being stuck on a CPU forever
> because it doesn't sleep anymore and as a result never wakeup and call
> feec(). Such task can be detected by comparing util_avg or runnable_avg
> with the compute capacity of the CPU. Once detected, we can call feec() to
> check if there is a better CPU for the stuck task. The call can be done in
> 2 places:
> - When the task is put back in the runnnable list after its running slice
> with the balance callback mecanism similarly to the rt/dl push callback.
> - During cfs tick when there is only 1 running task stuck on the CPU in
> which case the balance callback can't be used.
>
> This push callback mecanism with the new feec() algorithm ensures that
> tasks always get a chance to migrate on the best suitable CPU and don't
> stay stuck on a CPU which is no more the most suitable one. As examples:
> - A task waking on a big CPU with a uclamp max preventing it to sleep and
> wake up, can migrate on a smaller CPU once it's more power efficient.
> - The tasks are spread on CPUs in the PD when they target the same OPP.
>
> Patch 6 adds task misfit migration case in the cfs tick and push callback
> mecanism to prevent waking up an idle cpu unnecessarily.
>
> Patch 7 removes the need of testing uclamp_min in cpu_overutilized to
> trigger the active migration of a task on another CPU.
>
> Compared to v4:
> - Fixed check_pushable_task for !SMP
>
> Compared to v3:
> - Fixed the empty functions
>
> Compared to v2:
> - Renamed the push and tick functions to ease understanding what they do.
> Both are kept in the same patch as they solve the same problem.
> - Created some helper functions
> - Fixing some typos and comments
> - The task_stuck_on_cpu() condition remains unchanged. Pierre suggested to
> take into account the min capacity of the CPU but the is not directly
> available right now. It can trigger feec() when uclamp_max is very low
> compare to the min capacity of the CPU but the feec() should keep
> returning the same CPU. This can be handled in a follow on patch
>
> Compared to v1:
> - The call to feec() even when overutilized has been removed
> from this serie and will be adressed in a separate series. Only the case
> of uclamp_min has been kept as it is now handled by push callback and
> tick mecanism.
> - The push mecanism has been cleanup, fixed and simplified.
>
> This series implements some of the topics discussed at OSPM [1]. Other
> topics will be part of an other serie
>
> [1] https://youtu.be/PHEBAyxeM_M?si=ZApIOw3BS4SOLPwp
>
> Vincent Guittot (7):
> sched/fair: Filter false overloaded_group case for EAS
> energy model: Add a get previous state function
> sched/fair: Rework feec() to use cost instead of spare capacity
> energy model: Remove unused em_cpu_energy()
> sched/fair: Add push task mechanism for EAS
> sched/fair: Add misfit case to push task mecanism for EAS
> sched/fair: Update overutilized detection
>
> include/linux/energy_model.h | 111 ++----
> kernel/sched/fair.c | 721 ++++++++++++++++++++++++-----------
> kernel/sched/sched.h | 2 +
> 3 files changed, 518 insertions(+), 316 deletions(-)
>

Hi Vincent,
I'm giving this another go of reviewing after our OSPM discussions.
One thing which bothered me in the past is that it's just a lot going
on in this series, almost rewriting all of the EAS code in fair.c ;)

For easier reviewing I suggest splitting the series:
1. sched/fair: Filter false overloaded_group case for EAS
(Or actually just get this merged, no need carrying this around, is there?)
2. Rework feec to use more factors than just max_spare_cap to improve
responsiveness / reduce load (Patches 2,3,4)
3. Add push mechanism and make use of it for misfit migration (Patches
5,6,7)

In particular 2 & 3 could be separated, reviewed and tested on their own,
this would make it much easier to discuss what's being tackled here IMO.

Best regards,
Christian