Re: [PATCH] sched/fair: Decrease util_est in presence of idle time
From: Dietmar Eggemann
Date: Fri Dec 20 2024 - 09:48:27 EST
On 20/12/2024 08:47, Vincent Guittot wrote:
> On Thu, 19 Dec 2024 at 18:53, Vincent Guittot
> <vincent.guittot@xxxxxxxxxx> wrote:
>>
>> On Thu, 19 Dec 2024 at 10:12, Pierre Gondois <pierre.gondois@xxxxxxx> wrote:
>>>
>>> util_est signal does not decay if the task utilization is lower
>>> than its runnable signal by a value of 10. This was done to keep
>>
>> The value of 10 is the UTIL_EST_MARGIN that is used to know if it's
>> worth updating util_est
Might be that UTIL_EST_MARGIN is just too small for this usecase? Maybe
the mechanism is too sensitive?
It triggers already when running 10 5% tasks on a Juno-r0 (446 1024 1024
446 446 446) in cases 2 tasks are scheduled on the same little CPU:
...
task_n7-7-2623 [003] nr_queued=2 dequeued=17 rbl=40
task_n9-9-2625 [003] nr_queued=2 dequeued=13 rbl=29
task_n9-9-2625 [004] nr_queued=2 dequeued=23 rbl=55
task_n9-9-2625 [004] nr_queued=2 dequeued=22 rbl=53
...
I'm not sure if the original case (Speedometer on Pix6 ?) which lead to
this implementation was tested with perf/energy numbers back then?
>>> the util_est signal high in case a task shares a rq with another
>>> task and doesn't obtain a desired running time.
>>>
>>> However, tasks sharing a rq obtain the running time they desire
>>> provided that the rq has some idle time. Indeed, either:
>>> - a CPU is always running. The utilization signal of tasks reflects
>>> the running time they obtained. This running time depends on the
>>> niceness of the tasks. A decreasing utilization signal doesn't
>>> reflect a decrease of the task activity and the util_est signal
>>> should not be decayed in this case.
>>> - a CPU is not always running (i.e. there is some idle time). Tasks
>>> might be waiting to run, increasing their runnable signal, but
>>> eventually run to completion. A decreasing utilization signal
>>> does reflect a decrease of the task activity and the util_est
>>> signal should be decayed in this case.
>>
>> This is not always true
>> Run a task 40ms with a period of 100ms alone on the biggest cpu at max
>> compute capacity. its util_avg is up to 674 at dequeue as well as its
>> util_est
>> Then start a 2nd task with the exact same behavior on the same cpu.
>> The util_avg of this 2nd task will be only 496 at dequeue as well as
>> its util_est but there is still 20ms of idle time. Furthermore, The
>> util_avg of the 1st task is also around 496 at dequeue but
>
> the end of the sentence was missing...
>
> but there is still 20ms of idle time.
But these two tasks are still able to finish there activity within this
100ms window. So why should we keep their util_est values high when
dequeuing?
[...]
>>> The initial patch [2] aimed to solve an issue detected while running
>>> speedometer 2.0 [3]. While running speedometer 2.0 on a Pixel6, 3
>>> versions are compared:
>>> - base: the current version
>>
>> What do you mean by current version ? tip/sched/core ?
>>
>>> - patch: the new version, with this patch applied
>>> - revert: the initial version, with commit [2] reverted
>>>
>>> Score (higher is better):
>>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
>>> │ base mean ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
>>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
>>> │ 108.16 ┆ 104.06 ┆ 105.82 ┆ -3.94% ┆ -2.16% │
>>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
>>> ┌───────────┬───────────┬────────────┐
>>> │ base std ┆ patch std ┆ revert std │
>>> ╞═══════════╪═══════════╪════════════╡
>>> │ 0.57 ┆ 0.49 ┆ 0.58 │
>>> └───────────┴───────────┴────────────┘
>>>
>>> Energy measured with energy counters:
>>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
>>> │ base mean ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
>>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
>>> │ 141262.79 ┆ 130630.09 ┆ 134108.07 ┆ -7.52% ┆ -5.64% │
>>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
>>> ┌───────────┬───────────┬────────────┐
>>> │ base std ┆ patch std ┆ revert std │
>>> ╞═══════════╪═══════════╪════════════╡
>>> │ 1347.13 ┆ 2431.67 ┆ 510.88 │
>>> └───────────┴───────────┴────────────┘
>>>
>>> Energy computed from util signals and energy model:
>>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
>>> │ base mean ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
>>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
>>> │ 2.0539e12 ┆ 1.3569e12 ┆ 1.3637e+12 ┆ -33.93% ┆ -33.60% │
>>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
>>> ┌───────────┬───────────┬────────────┐
>>> │ base std ┆ patch std ┆ revert std │
>>> ╞═══════════╪═══════════╪════════════╡
>>> │ 2.9206e10 ┆ 2.5434e10 ┆ 1.7106e+10 │
>>> └───────────┴───────────┴────────────┘
>>>
>>> OU ratio in % (ratio of time being overutilized over total time).
>>> The test lasts ~65s:
>>> ┌────────────┬────────────┬─────────────┐
>>> │ base mean ┆ patch mean ┆ revert mean │
>>> ╞════════════╪════════════╪═════════════╡
>>> │ 63.39% ┆ 12.48% ┆ 12.28% │
>>> └────────────┴────────────┴─────────────┘
>>> ┌───────────┬───────────┬─────────────┐
>>> │ base std ┆ patch std ┆ revert mean │
>>> ╞═══════════╪═══════════╪═════════════╡
>>> │ 0.97 ┆ 0.28 ┆ 0.88 │
>>> └───────────┴───────────┴─────────────┘
>>>
>>> The energy gain can be explained by the fact that the system is
>>> overutilized during most of the test with the base version.
>>>
>>> During the test, the base condition is evaluated to true ~40%
>>> of the time. The new condition is evaluated to true ~2% of
>>> the time. Preventing util_est signals to decay with the base
>>> condition has a significant impact on the overutilized state
>>> due to an overestimation of the resulting utilization of tasks.
>>>
>>> The score is impacted by the patch, but:
>>> - it is expected to have slightly lower scores with EAS running more
>>> often
>>> - the base version making the system run at higher frequencies by
>>> overestimating task utilization, it is expected to have higher scores
>>
>> I'm not sure to get what you are trying to solve here ?
Yeah, the question is how much perf loss we accept for energy savings?
IMHO, impossible to answer generically based on one specific
workload/platform incarnation.
[...]
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 3e9ca38512de..d058ab29e52e 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -5033,7 +5033,7 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
>>> * To avoid underestimate of task utilization, skip updates of EWMA if
>>> * we cannot grant that thread got all CPU time it wanted.
>>> */
>>> - if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
>>> + if (rq_no_idle_pelt(rq_of(cfs_rq)))
>>
>> You can't use here the test that is done in
>> update_idle_rq_clock_pelt() to detect if we lost some idle time
>> because this test is only relevant when the rq becomes idle which is
>> not the case here
Do you mean this test ?
util_avg = util_sum / divider
util_sum >= divider * util_avg
with 'divider = LOAD_AVG_MAX - 1024' and 'util_avg = 1024 - 1' and upper
bound of the window (+ 1024):
util_sum >= (LOAD_AVG_MAX - 1024) << SCHED_CAPACITY_SHIFT - LOAD_AVG_MAX
Why can't we use it here?
>> With this test you skip completely the cases where the task has to
>> share the CPU with others. As an example on the pixel 6, the little
True. But I assume that's anticipated here. The assumption is that as
long as there is idle time, tasks get what they want in a time frame.
>> cpus must run more than 1.2 seconds at its max freq before detecting
>> that there is no idle time
BTW, I tried to figure out where the 1.2s comes from: 323ms * 1024/160 =
2.07s (with CPU capacity of Pix5 little CPU = 160)?
[...]