Re: [PATCH] sched/fair: Decrease util_est in presence of idle time

From: Vincent Guittot
Date: Thu Dec 19 2024 - 12:58:08 EST


On Thu, 19 Dec 2024 at 10:12, Pierre Gondois <pierre.gondois@xxxxxxx> wrote:
>
> util_est signal does not decay if the task utilization is lower
> than its runnable signal by a value of 10. This was done to keep

The value of 10 is the UTIL_EST_MARGIN that is used to know if it's
worth updating util_est

> the util_est signal high in case a task shares a rq with another
> task and doesn't obtain a desired running time.
>
> However, tasks sharing a rq obtain the running time they desire
> provided that the rq has some idle time. Indeed, either:
> - a CPU is always running. The utilization signal of tasks reflects
> the running time they obtained. This running time depends on the
> niceness of the tasks. A decreasing utilization signal doesn't
> reflect a decrease of the task activity and the util_est signal
> should not be decayed in this case.
> - a CPU is not always running (i.e. there is some idle time). Tasks
> might be waiting to run, increasing their runnable signal, but
> eventually run to completion. A decreasing utilization signal
> does reflect a decrease of the task activity and the util_est
> signal should be decayed in this case.

This is not always true
Run a task 40ms with a period of 100ms alone on the biggest cpu at max
compute capacity. its util_avg is up to 674 at dequeue as well as its
util_est
Then start a 2nd task with the exact same behavior on the same cpu.
The util_avg of this 2nd task will be only 496 at dequeue as well as
its util_est but there is still 20ms of idle time. Furthermore, The
util_avg of the 1st task is also around 496 at dequeue but

>
> ------------
>
> Running on a 1024 capacity CPU:
> - TaskA:
> - duty_cycle=60%, period=16ms, duration=2s
> - TaskB:
> - sleep=2ms (to let TaskA run first), duration=1s
> - first phase: duty_cycle=20%, period=16ms, duration=1s
> - second phase: duty_cycle=5%, period=16ms, duration=1s
>
> When TaskB starts the second phase, the util_est signal can take up to
> ~150ms before starting to decrease. Indeed, if TaskB runs after
> TaskA, its runnable signal will be higher than its util signal by more
> than 10 units.
> This creates unnecessary frequency spikes: upon enqueuing TaskB,
> util_est_cfs is increased by the old value of util_est of TaskB,
> impacting frequency selection through:
> sugov_get_util()
> \-cpu_util_cfs_boost()
> \-cpu_util()
> util_est = READ_ONCE(cfs_rq->avg.util_est);
>
> ------------
>
> Energy impact can also be measured as suggested by Hongyan at [2].
> On a Pixel6, the following workload is run 10 times:
> - TaskA:
> - duty_cycle=20%, duration=0.4s
> - one task per mid and big CPU (2 mid and 2 big CPUs, so 4 in total)
> - Used to increase the runnable signals of TaskB
> - TaskB:
> - sleep=2ms (to let TaskA run first)
> - first phase: duty_cycle=20%, duration=0.2s
> - second phase: duty_cycle=5%, duration=0.2s
> - 4 occurrences of TaskB
>
> The duration of the workload is low (0.4s) to emphasis the impact of
> continuing to run at an overestimated frequency.
>
> Energy measured with energy counters:
> ┌────────────┬────────────┬───────────┬───────────┬───────────┐
> │ base mean ┆ patch mean ┆ base std ┆ patch std ┆ ratio (%) │
> ╞════════════╪════════════╪═══════════╪═══════════╪═══════════╡
> │ 536.412419 ┆ 487.232935 ┆ 68.591493 ┆ 66.862019 ┆ -9.16 │
> └────────────┴────────────┴───────────┴───────────┴───────────┘
>
> Energy computed from util signals and energy model:
> ┌────────────┬────────────┬───────────┬───────────┬───────────┐
> │ base mean ┆ patch mean ┆ base std ┆ patch std ┆ ratio (%) │
> ╞════════════╪════════════╪═══════════╪═══════════╪═══════════╡
> │ 4.8318e9 ┆ 4.0823e9 ┆ 5.1044e8 ┆ 7.5558e8 ┆ -15.51 │
> └────────────┴────────────┴───────────┴───────────┴───────────┘
>
> ------------
>
> The initial patch [2] aimed to solve an issue detected while running
> speedometer 2.0 [3]. While running speedometer 2.0 on a Pixel6, 3
> versions are compared:
> - base: the current version

What do you mean by current version ? tip/sched/core ?

> - patch: the new version, with this patch applied
> - revert: the initial version, with commit [2] reverted
>
> Score (higher is better):
> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
> │ base mean ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
> │ 108.16 ┆ 104.06 ┆ 105.82 ┆ -3.94% ┆ -2.16% │
> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
> ┌───────────┬───────────┬────────────┐
> │ base std ┆ patch std ┆ revert std │
> ╞═══════════╪═══════════╪════════════╡
> │ 0.57 ┆ 0.49 ┆ 0.58 │
> └───────────┴───────────┴────────────┘
>
> Energy measured with energy counters:
> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
> │ base mean ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
> │ 141262.79 ┆ 130630.09 ┆ 134108.07 ┆ -7.52% ┆ -5.64% │
> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
> ┌───────────┬───────────┬────────────┐
> │ base std ┆ patch std ┆ revert std │
> ╞═══════════╪═══════════╪════════════╡
> │ 1347.13 ┆ 2431.67 ┆ 510.88 │
> └───────────┴───────────┴────────────┘
>
> Energy computed from util signals and energy model:
> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
> │ base mean ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
> │ 2.0539e12 ┆ 1.3569e12 ┆ 1.3637e+12 ┆ -33.93% ┆ -33.60% │
> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
> ┌───────────┬───────────┬────────────┐
> │ base std ┆ patch std ┆ revert std │
> ╞═══════════╪═══════════╪════════════╡
> │ 2.9206e10 ┆ 2.5434e10 ┆ 1.7106e+10 │
> └───────────┴───────────┴────────────┘
>
> OU ratio in % (ratio of time being overutilized over total time).
> The test lasts ~65s:
> ┌────────────┬────────────┬─────────────┐
> │ base mean ┆ patch mean ┆ revert mean │
> ╞════════════╪════════════╪═════════════╡
> │ 63.39% ┆ 12.48% ┆ 12.28% │
> └────────────┴────────────┴─────────────┘
> ┌───────────┬───────────┬─────────────┐
> │ base std ┆ patch std ┆ revert mean │
> ╞═══════════╪═══════════╪═════════════╡
> │ 0.97 ┆ 0.28 ┆ 0.88 │
> └───────────┴───────────┴─────────────┘
>
> The energy gain can be explained by the fact that the system is
> overutilized during most of the test with the base version.
>
> During the test, the base condition is evaluated to true ~40%
> of the time. The new condition is evaluated to true ~2% of
> the time. Preventing util_est signals to decay with the base
> condition has a significant impact on the overutilized state
> due to an overestimation of the resulting utilization of tasks.
>
> The score is impacted by the patch, but:
> - it is expected to have slightly lower scores with EAS running more
> often
> - the base version making the system run at higher frequencies by
> overestimating task utilization, it is expected to have higher scores

I'm not sure to get what you are trying to solve here ?

>
> ------------
>
> Decrease util_est when the rq has some idle time instead of a delta
> between util and runnable signals.
>
> [1] https://lore.kernel.org/lkml/39cde23a-19d8-4e64-a1d2-f26bce264883@xxxxxxx/
> [2] commit 50181c0cff31 ("sched/pelt: Avoid underestimation of task utilization")
> https://lore.kernel.org/lkml/20231122140119.472110-1-vincent.guittot@xxxxxxxxxx/
> [3] https://lore.kernel.org/lkml/CAKfTPtDd-HhF-YiNTtL9i5k0PfJbF819Yxu4YquzfXgwi7voyw@xxxxxxxxxxxxxx/#t
>
> Signed-off-by: Pierre Gondois <pierre.gondois@xxxxxxx>
> ---
> kernel/sched/fair.c | 2 +-
> kernel/sched/pelt.h | 23 ++++++++++++++++-------
> 2 files changed, 17 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3e9ca38512de..d058ab29e52e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5033,7 +5033,7 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
> * To avoid underestimate of task utilization, skip updates of EWMA if
> * we cannot grant that thread got all CPU time it wanted.
> */
> - if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
> + if (rq_no_idle_pelt(rq_of(cfs_rq)))

You can't use here the test that is done in
update_idle_rq_clock_pelt() to detect if we lost some idle time
because this test is only relevant when the rq becomes idle which is
not the case here

With this test you skip completely the cases where the task has to
share the CPU with others. As an example on the pixel 6, the little
cpus must run more than 1.2 seconds at its max freq before detecting
that there is no idle time

> goto done;
>
>
> diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
> index f4f6a0875c66..eaf34f95fd93 100644
> --- a/kernel/sched/pelt.h
> +++ b/kernel/sched/pelt.h
> @@ -123,21 +123,30 @@ static inline void update_rq_clock_pelt(struct rq *rq, s64 delta)
> }
>
> /*
> - * When rq becomes idle, we have to check if it has lost idle time
> - * because it was fully busy. A rq is fully used when the /Sum util_sum
> - * is greater or equal to:
> + * A rq is fully used (or is considered to not have enough idle time)
> + * when the /Sum util_sum is greater or equal to:
> * (LOAD_AVG_MAX - 1024 + rq->cfs.avg.period_contrib) << SCHED_CAPACITY_SHIFT;
> * For optimization and computing rounding purpose, we don't take into account
> * the position in the current window (period_contrib) and we use the higher
> * bound of util_sum to decide.
> */
> -static inline void update_idle_rq_clock_pelt(struct rq *rq)
> +static inline bool rq_no_idle_pelt(struct rq *rq)
> {
> - u32 divider = ((LOAD_AVG_MAX - 1024) << SCHED_CAPACITY_SHIFT) - LOAD_AVG_MAX;
> - u32 util_sum = rq->cfs.avg.util_sum;
> + u32 util_sum, divider;
> +
> + divider = (PELT_MIN_DIVIDER << SCHED_CAPACITY_SHIFT) - LOAD_AVG_MAX;
> + util_sum = rq->cfs.avg.util_sum;
> util_sum += rq->avg_rt.util_sum;
> util_sum += rq->avg_dl.util_sum;
> + return util_sum >= divider;
> +}
>
> +/*
> + * When rq becomes idle, we have to check if it has lost idle time
> + * because it was fully busy.
> + */
> +static inline void update_idle_rq_clock_pelt(struct rq *rq)
> +{
> /*
> * Reflecting stolen time makes sense only if the idle
> * phase would be present at max capacity. As soon as the
> @@ -147,7 +156,7 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
> * this case. We keep track of this lost idle time compare to
> * rq's clock_task.
> */
> - if (util_sum >= divider)
> + if (rq_no_idle_pelt(rq))
> rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
>
> _update_idle_rq_clock_pelt(rq);
> --
> 2.25.1
>