Re: [PATCH 1/5] sched/fair: Filter false overloaded_group case for EAS

From: Pierre Gondois
Date: Fri Sep 13 2024 - 09:22:11 EST

Next message: Christian Theune: "Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards)"
Previous message: Andy Shevchenko: "[PATCH v1 1/1] sub: cdns2: Use predefined PCI vendor ID constant"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello Vincent,

I have been trying this patch with the following workload, on a Pixel6
(4 littles, 2 mid, 2 big):
a. 5 tasks with: [UCLAMP_MIN:0, UCLAMP_MAX:1, duty_cycle=100%, cpuset:0-2]
b. 1 task with: [duty_cycle=100%, cpuset:0-7] but starting on CPU4

a.
There are many UCLAMP_MAX task also to pass the following condition
to tag a group as overloaded.
group_is_overloaded()
\-(sgs->sum_nr_running <= sgs->group_weight)
These tasks should put

b. to see if a CPU-bound task is migrated to the big cluster.

---
- Without patch 5 [RFC PATCH 5/5] sched/fair: Add push task callback for EAS
- Without this patch
The migration is effectively due to the load_balancer selecting the
little cluster over the mid cluster.
The little cluster put the system in an overutilized state.

---
- Without patch 5 [RFC PATCH 5/5] sched/fair: Add push task callback for EAS
- With this patch
The load_balancer effectively selects the medium cluster over the little
cluster (since none of the little CPU is overutilized). The load_balancer
migrates the task b. to a big CPU.

Note:
This is true most of the time, but whenever a non-UCLAMP_MAX tasks wakes-up
on one of CPU0-3 (where the UCLAMP_MAX are pinned), the cluster becomes
overutilized and the new mechanism is bypassed.
Same thing if a task with [UCLAMP_MIN:0, UCLAMP_MAX:1024, duty_cycle=100%, cpuset:0]
is added to the workload.

---
- With patch 5 [RFC PATCH 5/5] sched/fair: Add push task callback for EAS
- Without this patch

The task b. gets an opportunity to migrate to a big CPU through the sched_tick.
However with both patches are applied, the migration is triggered by the
load_balancer.

---
So FWIW, from a mechanism PoV and independently from patch 5:
Tested-by: Pierre Gondois <pierre.gondois@xxxxxxx>

On 8/30/24 15:03, Vincent Guittot wrote:

With EAS, a group should be set overloaded if at least 1 CPU in the group
is overutilized bit it can happen that a CPU is fully utilized by tasks
because of clamping the compute capacity of the CPU. In such case, the CPU
is not overutilized and as a result should not be set overloaded as well.

group_overloaded being a higher priority than group_misfit, such group can
be selected as the busiest group instead of a group with a mistfit task
and prevents load_balance to select the CPU with the misfit task to pull
the latter on a fitting CPU.

Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
---
kernel/sched/fair.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fea057b311f6..e67d6029b269 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9806,6 +9806,7 @@ struct sg_lb_stats {
enum group_type group_type;
unsigned int group_asym_packing; /* Tasks should be moved to preferred CPU */
unsigned int group_smt_balance; /* Task on busy SMT be moved */
+ unsigned long group_overutilized; /* No CPU is overutilized in the group */
unsigned long group_misfit_task_load; /* A CPU has a task too big for its capacity */
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
@@ -10039,6 +10040,13 @@ group_has_capacity(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
static inline bool
group_is_overloaded(unsigned int imbalance_pct, struct sg_lb_stats *sgs)
{
+ /*
+ * With EAS and uclamp, 1 CPU in the group must be overutilized to
+ * consider the group overloaded.
+ */
+ if (sched_energy_enabled() && !sgs->group_overutilized)
+ return false;
+
if (sgs->sum_nr_running <= sgs->group_weight)
return false;
@@ -10252,8 +10260,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
if (nr_running > 1)
*sg_overloaded = 1;
- if (cpu_overutilized(i))
+ if (cpu_overutilized(i)) {
*sg_overutilized = 1;
+ sgs->group_overutilized = 1;
+ }
#ifdef CONFIG_NUMA_BALANCING
sgs->nr_numa_running += rq->nr_numa_running;

Next message: Christian Theune: "Re: Known and unfixed active data loss bug in MM + XFS with large folios since Dec 2021 (any kernel from 6.1 upwards)"
Previous message: Andy Shevchenko: "[PATCH v1 1/1] sub: cdns2: Use predefined PCI vendor ID constant"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]