Re: [PATCH] sched/fair: Decrease util_est in presence of idle time
From: Vincent Guittot
Date: Fri Jan 10 2025 - 04:06:36 EST
On Thu, 9 Jan 2025 at 16:32, Pierre Gondois <pierre.gondois@xxxxxxx> wrote:
>
> Hello Vincent,
>
> Thanks for the review,
>
> On 12/20/24 16:05, Vincent Guittot wrote:
> > On Fri, 20 Dec 2024 at 15:48, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
> >>
> >> On 20/12/2024 08:47, Vincent Guittot wrote:
> >>> On Thu, 19 Dec 2024 at 18:53, Vincent Guittot
> >>> <vincent.guittot@xxxxxxxxxx> wrote:
> >>>>
> >>>> On Thu, 19 Dec 2024 at 10:12, Pierre Gondois <pierre.gondois@xxxxxxx> wrote:
> >>>>>
> >>>>> util_est signal does not decay if the task utilization is lower
> >>>>> than its runnable signal by a value of 10. This was done to keep
> >>>>
> >>>> The value of 10 is the UTIL_EST_MARGIN that is used to know if it's
> >>>> worth updating util_est
> >> Might be that UTIL_EST_MARGIN is just too small for this usecase? Maybe
> >> the mechanism is too sensitive?
> >
> > The default config is to follow util_est update
> >
> >>
> >> It triggers already when running 10 5% tasks on a Juno-r0 (446 1024 1024
> >> 446 446 446) in cases 2 tasks are scheduled on the same little CPU:
> >>
> >> ...
> >> task_n7-7-2623 [003] nr_queued=2 dequeued=17 rbl=40
> >> task_n9-9-2625 [003] nr_queued=2 dequeued=13 rbl=29
> >> task_n9-9-2625 [004] nr_queued=2 dequeued=23 rbl=55
> >> task_n9-9-2625 [004] nr_queued=2 dequeued=22 rbl=53
> >> ...
> >>
> >> I'm not sure if the original case (Speedometer on Pix6 ?) which lead to
> >> this implementation was tested with perf/energy numbers back then?
> >>
> >>>>> the util_est signal high in case a task shares a rq with another
> >>>>> task and doesn't obtain a desired running time.
> >>>>>
> >>>>> However, tasks sharing a rq obtain the running time they desire
> >>>>> provided that the rq has some idle time. Indeed, either:
> >>>>> - a CPU is always running. The utilization signal of tasks reflects
> >>>>> the running time they obtained. This running time depends on the
> >>>>> niceness of the tasks. A decreasing utilization signal doesn't
> >>>>> reflect a decrease of the task activity and the util_est signal
> >>>>> should not be decayed in this case.
> >>>>> - a CPU is not always running (i.e. there is some idle time). Tasks
> >>>>> might be waiting to run, increasing their runnable signal, but
> >>>>> eventually run to completion. A decreasing utilization signal
> >>>>> does reflect a decrease of the task activity and the util_est
> >>>>> signal should be decayed in this case.
> >>>>
> >>>> This is not always true
> >>>> Run a task 40ms with a period of 100ms alone on the biggest cpu at max
> >>>> compute capacity. its util_avg is up to 674 at dequeue as well as its
> >>>> util_est
> >>>> Then start a 2nd task with the exact same behavior on the same cpu.
> >>>> The util_avg of this 2nd task will be only 496 at dequeue as well as
> >>>> its util_est but there is still 20ms of idle time. Furthermore, The
> >>>> util_avg of the 1st task is also around 496 at dequeue but
> >>>
> >>> the end of the sentence was missing...
> >>>
> >>> but there is still 20ms of idle time.
> >>
> >> But these two tasks are still able to finish there activity within this
> >> 100ms window. So why should we keep their util_est values high when
> >> dequeuing?
> >
> > But then, the util_est decreases from the original value compared to
> > alone whereas its utilization is the same
>
> In the example with one task, it is possible to have a utilization as high
> as we want by increasing the period. With a period of 200ms, the task
> reaches a utilization of 750, and with a period of 300ms the max utilization
> is 870.
> Having a high utilization at dequeue is a usefull information stored in
> util_est. It allows to track down that even though the utilization of the task
> had time to decrease, the task actually represents a big quantity of
> instructions to execute. The task should be handled accordingly.
>
> On the other side, by decreasing the period, the lowest max utilization we
> can get is 40% * 1024 = 410.
>
> ------------
>
> By having 2 tasks sharing the CPU, the utilization graph is smoothed as one
> big period of 40ms followed by 60ms of idle time becomes:
> - when the 2 tasks are running, both tasks run alternatively during one sched
> slice ~=4ms, so the 40ms running phase becomes a periodic phase with a period
> of 8ms and a duty cycle of 50%
> - the 60ms idle time is reduced to 20ms idle time for each task
> The fact that these tasks could run longer than one sched slice is reflected
> in the runnable signal of the tasks.
> The duty cycle of the tasks in the co-scheduling phase is 50% and the duty
> cycle over the 100ms period is 40%. So the utilization of the tasks can reach
> 40% * 1024. This is ok, tasks don't prevent each other to reach a utilization
> value corresponding to their actual duty_cycle.
>
> This patch intends to detect when a periodic task cannot reach a utilization
> value of duty_cycle * 1024 due to other tasks requiring to run.
> This would be the case for instance if there were 3 tasks with:
> duty_cycle=40%, period=100ms, running during 300ms
> In this case, the total running time of the CPU is:
> 3(tasks) * 40(ms) * 3(periods) = 360ms
> There is no idle time during these 360ms and the utilization of tasks reaches
> at most 369 (369 < 0.4*1024).
>
> This is different from the case where the task utilization is lower than their
> runnable signal. The following task:
> ---
> To get a high util_est / low utilization value:
> - Run during a long period
> - Idle during a long period
> Then loop n times:
> - Periodic during 80ms, period=8ms, duty_cycle=51%
> (note that the duty_cycle is set to 51% to be sure the running time is
> superior to a sched slice of 4ms)
> - Idle during 20ms
> ---
> would:
> - allow decaying util_est during the looping phase if there was one task
> - not allow decaying util_est during the looping phase if there were 2 tasks.
> Indeed the runnable signal of the tasks would be higher than their util
> signal.
>
> However, the looping phase doesn't represent a long and continuous amount of
> instruction to execute. The profile of the task changed and the util_est
> value should reflect that.
> Checking the delta between the runnable and utilization signal doesn't allow to
> detect that the profile of the task changed. Indeed, being runnable doesn't
> mean being runnable all the time a task is runnable.
I fully agree that the current solution is not perfect as it assumes
that when runnable_avg > util_avg, the task didn't fully run as
expected and its util_avg at dequeue might not be correct as described
in my example. I also agree that some other case fall in this
condition whereas it should not but your proposal fail to detect this
correctly
>
> >
> >>
> >> [...]
> >>
> >>>>> The initial patch [2] aimed to solve an issue detected while running
> >>>>> speedometer 2.0 [3]. While running speedometer 2.0 on a Pixel6, 3
> >>>>> versions are compared:
> >>>>> - base: the current version
> >>>>
> >>>> What do you mean by current version ? tip/sched/core ?
>
> I meant using the following condition:
> (dequeued + UTIL_EST_MARGIN) < task_runnable(p)
I meant what is your base tree ? v6.12 ? v6.13-rcX ? tip/sched/core
I tried your patch on top of android mainline v6.12 but don't get the
same results; In particular for the Overutilized ratio.
In my tests, your patch doesn't make any real difference:
similar speedometer score 87.96 vs 87.4 (running locally and not over wifi)
similar overutilized ratio 67% vs 61%
similar energy counters 171232171 vs 166066813 (/Sum of CPUs clusters counters)
These results means the same as the thermal environment (ambient temp
and skin temp at beg of the test) and the thermal mitigation have an
impact on results
What am I missing compared to your setup ?
>
> >>>>
> >>>>> - patch: the new version, with this patch applied
> >>>>> - revert: the initial version, with commit [2] reverted
> >>>>>
> >>>>> Score (higher is better):
> >>>>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
> >>>>> │ base mean ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
> >>>>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
> >>>>> │ 108.16 ┆ 104.06 ┆ 105.82 ┆ -3.94% ┆ -2.16% │
> >>>>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
> >>>>> ┌───────────┬───────────┬────────────┐
> >>>>> │ base std ┆ patch std ┆ revert std │
> >>>>> ╞═══════════╪═══════════╪════════════╡
> >>>>> │ 0.57 ┆ 0.49 ┆ 0.58 │
> >>>>> └───────────┴───────────┴────────────┘
> >>>>>
> >>>>> Energy measured with energy counters:
> >>>>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
> >>>>> │ base mean ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
> >>>>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
> >>>>> │ 141262.79 ┆ 130630.09 ┆ 134108.07 ┆ -7.52% ┆ -5.64% │
> >>>>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
> >>>>> ┌───────────┬───────────┬────────────┐
> >>>>> │ base std ┆ patch std ┆ revert std │
> >>>>> ╞═══════════╪═══════════╪════════════╡
> >>>>> │ 1347.13 ┆ 2431.67 ┆ 510.88 │
> >>>>> └───────────┴───────────┴────────────┘
> >>>>>
> >>>>> Energy computed from util signals and energy model:
> >>>>> ┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
> >>>>> │ base mean ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
> >>>>> ╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
> >>>>> │ 2.0539e12 ┆ 1.3569e12 ┆ 1.3637e+12 ┆ -33.93% ┆ -33.60% │
> >>>>> └────────────┴────────────┴────────────┴─────────────┴──────────────┘
> >>>>> ┌───────────┬───────────┬────────────┐
> >>>>> │ base std ┆ patch std ┆ revert std │
> >>>>> ╞═══════════╪═══════════╪════════════╡
> >>>>> │ 2.9206e10 ┆ 2.5434e10 ┆ 1.7106e+10 │
> >>>>> └───────────┴───────────┴────────────┘
> >>>>>
> >>>>> OU ratio in % (ratio of time being overutilized over total time).
> >>>>> The test lasts ~65s:
> >>>>> ┌────────────┬────────────┬─────────────┐
> >>>>> │ base mean ┆ patch mean ┆ revert mean │
> >>>>> ╞════════════╪════════════╪═════════════╡
> >>>>> │ 63.39% ┆ 12.48% ┆ 12.28% │
> >>>>> └────────────┴────────────┴─────────────┘
> >>>>> ┌───────────┬───────────┬─────────────┐
> >>>>> │ base std ┆ patch std ┆ revert mean │
> >>>>> ╞═══════════╪═══════════╪═════════════╡
> >>>>> │ 0.97 ┆ 0.28 ┆ 0.88 │
> >>>>> └───────────┴───────────┴─────────────┘
> >>>>>
[...]
> >> [...]
> >>
> >>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>>> index 3e9ca38512de..d058ab29e52e 100644
> >>>>> --- a/kernel/sched/fair.c
> >>>>> +++ b/kernel/sched/fair.c
> >>>>> @@ -5033,7 +5033,7 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
> >>>>> * To avoid underestimate of task utilization, skip updates of EWMA if
> >>>>> * we cannot grant that thread got all CPU time it wanted.
> >>>>> */
> >>>>> - if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
> >>>>> + if (rq_no_idle_pelt(rq_of(cfs_rq)))
> >>>>
> >>>> You can't use here the test that is done in
> >>>> update_idle_rq_clock_pelt() to detect if we lost some idle time
> >>>> because this test is only relevant when the rq becomes idle which is
> >>>> not the case here
> >>
> >> Do you mean this test ?
> >>
> >> util_avg = util_sum / divider
> >>
> >> util_sum >= divider * util_avg
> >>
> >> with 'divider = LOAD_AVG_MAX - 1024' and 'util_avg = 1024 - 1' and upper
> >> bound of the window (+ 1024):
> >>
> >> util_sum >= (LOAD_AVG_MAX - 1024) << SCHED_CAPACITY_SHIFT - LOAD_AVG_MAX
> >>
> >> Why can't we use it here?
> >
> > because of the example below, it makes the filtering a nop for a very
> > large time and you will be overutilized far before
>
>
> To estimate the amount of time a task requires to reach a certain utilization
> value, I did the following:
> - Computing the accumulated sum of 'pelt graph' for the first 12 * 32ms.
You can also do
(1-y^r)*1024 where r is the number of 1024 us periods
and
(1-y^r) / (1-y^p) when you have a task running r period with a task
period of p 1024us
Keep in mind that we track 1024us and not 1000us
>
[...]
> - Due to some approximations during the computation (I presume) the accumulated
sum doesn't converge toward 47742, but toward 46718, so I'll use 46718.
It's not an approximation: 46718 = 47742*y
The computation is done at the end of a complete pelt period (which is
decayed) before accumulating the current period
[...]
>
> ------------
>
> All of this just to highlight that:
> - being overutilized already depends on the capacity of a CPU
> - the lower the capacity, the easier it is to become overutilized
> This is if overutilized means 'having a CPU utilization reaching 80% of a
> CPU capacity'.
This is the current implementation of cpu overutilized detection
>
> If overutilized means 'not having enough compute power to correctly estimate
> a task utilization', then indeed it takes 2.07s for a 160-capacity CPU to
> realize that. But FWIU, this is the current behaviour as CPUs have the ability
> to estimate a task utilization beyond their own capacity.
After this sentence above, I'm not sure what you mean by overutilized ?
Being overutilized and being able to correctly estimate task
utilization are 2 different things.
Until we reach 1024, we can't say if the task didn't get enough cycles
to finish what it has to do. And this whatever the compute capacity.
When there is idle time and cpu utilization is not 1024 then it means
that there were enough compute capacity (but not that we didn't change
the behavior of the task)
Testing util_sum >= divider is relevant when the CPU becomes idle to
know if we miss accounting some cycle. But testing util_sum < divider
at runtime doesn't mean that we didn't lose some cycles, just that we
didn't detect it yet
> I don't see why having 2 tasks instead of 1 would make a difference, their
> utilization would just raise half fast as if their were alone on the CPU,
> but nothing more IIUC.
There is a difference as described in my example in my previous email
because the utilization pattern in the period is not the same in this
case and PELT is not a linear with time
>
> ------------
>
> Also, I think the original issue is to detect cases where tasks cannot reach
> a max utilization corresponding to their duty cycle. I.e. cases where the
> utilization of a task is always strictly below the value
> (task_duty_cycle * 1024). This being due to other tasks preventing to run
> as much time as desired.
> I don't think this is what happens when 2 tasks run on a non-big CPU, as long
> as there is idle time on the non-big CPU. This even though their respective
But this is not what your patch/test does !
> utilization goes above the CPU capacity.
Max utilization of a task is always > (task_duty_cycle * 1024) and the
longer the period is the larger the diff is
And some task with its max utilization > CPU compute capacity, can fit
on this CPU. But at now we don't detect these cases so we assume the
task doesn't fit
>
> On a 512-capacity CPU, 2 periodic tasks with a duty cycle of 20% and a period
> of 100ms should have correct utilization values, even if the utilization of the
> CPU goes above its capacity. On the Pixel6 where mid CPUs have a capacity of
> 498, these tasks reach a utilization of 323, and the CPU reaches a utilization
> of 662.
An easier solution would be to not use the /Sum of util est to know if
a CPU is overutilized or not but only to select the OPP. Something
like the below:
@@ -8069,7 +8069,7 @@ cpu_util(int cpu, struct task_struct *p, int
dst_cpu, int boost)
else if (p && task_cpu(p) != cpu && dst_cpu == cpu)
util += task_util(p);
- if (sched_feat(UTIL_EST)) {
+ if (sched_feat(UTIL_EST) && boost) {
unsigned long util_est;
util_est = READ_ONCE(cfs_rq->avg.util_est);
>
>
> >
> >>
> >>>> With this test you skip completely the cases where the task has to
> >>>> share the CPU with others. As an example on the pixel 6, the little
> >>
> >> True. But I assume that's anticipated here. The assumption is that as
> >> long as there is idle time, tasks get what they want in a time frame.
> >>
> >>>> cpus must run more than 1.2 seconds at its max freq before detecting
> >>>> that there is no idle time
> >>
> >> BTW, I tried to figure out where the 1.2s comes from: 323ms * 1024/160 =
> >> 2.07s (with CPU capacity of Pix5 little CPU = 160)?
> >
> > yeah, I use the wrong rb5 little capacity instead of pixel6 but that even worse
> >
> >>
> >> [...]