Re: [RFC PATCH v3 3/6] sched/fair: Use util biases for utilization and frequency

From: Hongyan Xia
Date: Tue May 28 2024 - 07:40:05 EST


On 26/05/2024 23:52, Dietmar Eggemann wrote:
On 07/05/2024 14:50, Hongyan Xia wrote:
Use the new util_avg_bias for task and runqueue utilization. We also
maintain separate util_est and util_est_uclamp signals.

Now that we have the sum aggregated CFS util value, we do not need to

There is the work uclamp missing somehow here.

Ack.

consult uclamp buckets to know how the frequency should be clamped. We
simply look at the aggregated top level rq->cfs.avg.util_avg +
rq->cfs.avg.util_avg_bias and rq->cfs.avg.util_est_uclamp to know what
frequency to choose and how to place tasks.

[...]

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 571c8de59508..14376f23a8b7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4819,14 +4819,14 @@ static inline unsigned long cfs_rq_load_avg(struct cfs_rq *cfs_rq)
static int sched_balance_newidle(struct rq *this_rq, struct rq_flags *rf);
-static inline unsigned long task_util(struct task_struct *p)
+static inline unsigned long task_runnable(struct task_struct *p)
{
- return READ_ONCE(p->se.avg.util_avg);
+ return READ_ONCE(p->se.avg.runnable_avg);
}
-static inline unsigned long task_runnable(struct task_struct *p)
+static inline unsigned long task_util(struct task_struct *p)
{
- return READ_ONCE(p->se.avg.runnable_avg);
+ return READ_ONCE(p->se.avg.util_avg);
}
static inline unsigned long _task_util_est(struct task_struct *p)
@@ -4839,6 +4839,52 @@ static inline unsigned long task_util_est(struct task_struct *p)
return max(task_util(p), _task_util_est(p));
}

IMHO, we now have two complete hierarchies of util signals:

(a) (b) (uclamp version of (a))

(1) task_util() task_util_uclamp() (+ task_util_bias())

(2) _task_util_est() _task_util_est_uclamp()

(3) task_util_est() task_util_est_uclamp()

(4) cpu_util() cpu_util_uclamp()

(5) cpu_util_cfs() cpu_util_cfs_uclamp()

(6) cpu_util_cfs_boost() cpu_util_cfs_boost_uclamp()


For (4), (5), (6) we now have:

--- cpu_util() cpu_util_uclamp()

eenv_pd_busy_time() x

eenv_pd_max_util() x

find_energy_efficient_cpu() x x <-- (A)

cpu_util_without() x

--- cpu_util_cfs() cpu_util_cfs_uclamp()

cpu_overutilized() x x <-- (B)

update_sg_lb_stats() x

update_numa_stats() x

sched_cpu_util() x

--- cpu_util_cfs_boost() cpu_util_cfs_boost_uclamp()

sched_balance_find_src_rq() x

sugov_get_util() x

uclamp version falls back to mon-uclamp version for !CONFIG_UCLAMP_TASK.
So it looks like taht in this case we calculate the same cpu utilization
value twice in (A) and (B).

That is indeed the case. I'll see how I can avoid this duplication, hopefully without too much #ifdef mess.

[...]

@@ -4970,134 +5026,29 @@ static inline unsigned long get_actual_cpu_capacity(int cpu)
}
static inline int util_fits_cpu(unsigned long util,
- unsigned long uclamp_min,
- unsigned long uclamp_max,
+ unsigned long util_uclamp,
int cpu)
{
unsigned long capacity = capacity_of(cpu);
- unsigned long capacity_orig;
- bool fits, uclamp_max_fits;
-
- /*
- * Check if the real util fits without any uclamp boost/cap applied.
- */
- fits = fits_capacity(util, capacity);
-
- if (!uclamp_is_used())
- return fits;
-
- /*
- * We must use arch_scale_cpu_capacity() for comparing against uclamp_min and
- * uclamp_max. We only care about capacity pressure (by using
- * capacity_of()) for comparing against the real util.
- *
- * If a task is boosted to 1024 for example, we don't want a tiny
- * pressure to skew the check whether it fits a CPU or not.
- *
- * Similarly if a task is capped to arch_scale_cpu_capacity(little_cpu), it
- * should fit a little cpu even if there's some pressure.
- *
- * Only exception is for HW or cpufreq pressure since it has a direct impact
- * on available OPP of the system.
- *
- * We honour it for uclamp_min only as a drop in performance level
- * could result in not getting the requested minimum performance level.
- *
- * For uclamp_max, we can tolerate a drop in performance level as the
- * goal is to cap the task. So it's okay if it's getting less.
- */
- capacity_orig = arch_scale_cpu_capacity(cpu);
- /*
- * We want to force a task to fit a cpu as implied by uclamp_max.
- * But we do have some corner cases to cater for..
- *
- *
- * C=z
- * | ___
- * | C=y | |
- * |_ _ _ _ _ _ _ _ _ ___ _ _ _ | _ | _ _ _ _ _ uclamp_max
- * | C=x | | | |
- * | ___ | | | |
- * | | | | | | | (util somewhere in this region)
- * | | | | | | |
- * | | | | | | |
- * +----------------------------------------
- * CPU0 CPU1 CPU2
- *
- * In the above example if a task is capped to a specific performance
- * point, y, then when:
- *
- * * util = 80% of x then it does not fit on CPU0 and should migrate
- * to CPU1
- * * util = 80% of y then it is forced to fit on CPU1 to honour
- * uclamp_max request.
- *
- * which is what we're enforcing here. A task always fits if
- * uclamp_max <= capacity_orig. But when uclamp_max > capacity_orig,
- * the normal upmigration rules should withhold still.
- *
- * Only exception is when we are on max capacity, then we need to be
- * careful not to block overutilized state. This is so because:
- *
- * 1. There's no concept of capping at max_capacity! We can't go
- * beyond this performance level anyway.
- * 2. The system is being saturated when we're operating near
- * max capacity, it doesn't make sense to block overutilized.
- */
- uclamp_max_fits = (capacity_orig == SCHED_CAPACITY_SCALE) && (uclamp_max == SCHED_CAPACITY_SCALE);
- uclamp_max_fits = !uclamp_max_fits && (uclamp_max <= capacity_orig);
- fits = fits || uclamp_max_fits;
+ if (fits_capacity(util_uclamp, capacity))
+ return 1;
- /*
- *
- * C=z
- * | ___ (region a, capped, util >= uclamp_max)
- * | C=y | |
- * |_ _ _ _ _ _ _ _ _ ___ _ _ _ | _ | _ _ _ _ _ uclamp_max
- * | C=x | | | |
- * | ___ | | | | (region b, uclamp_min <= util <= uclamp_max)
- * |_ _ _|_ _|_ _ _ _| _ | _ _ _| _ | _ _ _ _ _ uclamp_min
- * | | | | | | |
- * | | | | | | | (region c, boosted, util < uclamp_min)
- * +----------------------------------------
- * CPU0 CPU1 CPU2
- *
- * a) If util > uclamp_max, then we're capped, we don't care about
- * actual fitness value here. We only care if uclamp_max fits
- * capacity without taking margin/pressure into account.
- * See comment above.
- *
- * b) If uclamp_min <= util <= uclamp_max, then the normal
- * fits_capacity() rules apply. Except we need to ensure that we
- * enforce we remain within uclamp_max, see comment above.
- *
- * c) If util < uclamp_min, then we are boosted. Same as (b) but we
- * need to take into account the boosted value fits the CPU without
- * taking margin/pressure into account.
- *
- * Cases (a) and (b) are handled in the 'fits' variable already. We
- * just need to consider an extra check for case (c) after ensuring we
- * handle the case uclamp_min > uclamp_max.
- */
- uclamp_min = min(uclamp_min, uclamp_max);
- if (fits && (util < uclamp_min) &&
- (uclamp_min > get_actual_cpu_capacity(cpu)))
+ if (fits_capacity(util, capacity))
return -1;
- return fits;
+ return 0;

The possible return values stay the same it seems: 1 if uclamp (and so
util_avg) fit, -1 if util_avg fit and 0 if uclamp and uclamp don't fit.

Yes, this is the intended change to address some comments elsewhere and in previous revisions.

[...]

@@ -6914,6 +6861,10 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
"0 tasks on CFS of CPU %d, but util_avg_bias is %d\n",
rq->cpu, rq->cfs.avg.util_avg_bias);
WRITE_ONCE(rq->cfs.avg.util_avg_bias, 0);
+ WARN_ONCE(rq->cfs.avg.util_est_uclamp,
+ "0 tasks on CFS of CPU %d, but util_est_uclamp is %u\n",
+ rq->cpu, rq->cfs.avg.util_est_uclamp);
+ WRITE_ONCE(rq->cfs.avg.util_est_uclamp, 0);

Maybe better:

if (WARN_ONCE(...
WRITE_ONCE(rq->cfs.avg.util_est_uclamp, 0);

How can this happen, that there are 0 tasks on rq->cfs hierarcy and
rq->cfs.avg.util_est_uclamp is !0 ? Is there a specific code path when
this happens?

Ack.

This should never happen. Triggering the warning here means something has gone very wrong in the maths and should be reported.

}
#endif
}
@@ -7485,7 +7436,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
static int
select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
{
- unsigned long task_util, util_min, util_max, best_cap = 0;
+ unsigned long task_util, task_util_uclamp, best_cap = 0;
int fits, best_fits = 0;
int cpu, best_cpu = -1;
struct cpumask *cpus;
@@ -7494,8 +7445,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
task_util = task_util_est(p);
- util_min = uclamp_eff_value(p, UCLAMP_MIN);
- util_max = uclamp_eff_value(p, UCLAMP_MAX);
+ task_util_uclamp = task_util_est_uclamp(p);

select_idle_sibling() could pass task_util and task_util_uclamp into
select_idle_capacity(). This way we would save these calls to
task_util_est() and task_util_est_uclamp() here.

Good idea.

for_each_cpu_wrap(cpu, cpus, target) {
unsigned long cpu_cap = capacity_of(cpu);
@@ -7503,7 +7453,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
continue;
- fits = util_fits_cpu(task_util, util_min, util_max, cpu);
+ fits = util_fits_cpu(task_util, task_util_uclamp, cpu);
/* This CPU fits with all requirements */
if (fits > 0)

[...]