Re: [PATCH] sched/fair: Decrease util_est in presence of idle time
From: Pierre Gondois
Date: Thu Jan 09 2025 - 10:32:11 EST
Hello Vincent,
Thanks for the review,
On 12/20/24 16:05, Vincent Guittot wrote:
On Fri, 20 Dec 2024 at 15:48, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
On 20/12/2024 08:47, Vincent Guittot wrote:
On Thu, 19 Dec 2024 at 18:53, Vincent Guittot
<vincent.guittot@xxxxxxxxxx> wrote:
On Thu, 19 Dec 2024 at 10:12, Pierre Gondois <pierre.gondois@xxxxxxx> wrote:
util_est signal does not decay if the task utilization is lower
than its runnable signal by a value of 10. This was done to keep
The value of 10 is the UTIL_EST_MARGIN that is used to know if it's
worth updating util_est
Might be that UTIL_EST_MARGIN is just too small for this usecase? Maybe
the mechanism is too sensitive?
The default config is to follow util_est update
It triggers already when running 10 5% tasks on a Juno-r0 (446 1024 1024
446 446 446) in cases 2 tasks are scheduled on the same little CPU:
...
task_n7-7-2623 [003] nr_queued=2 dequeued=17 rbl=40
task_n9-9-2625 [003] nr_queued=2 dequeued=13 rbl=29
task_n9-9-2625 [004] nr_queued=2 dequeued=23 rbl=55
task_n9-9-2625 [004] nr_queued=2 dequeued=22 rbl=53
...
I'm not sure if the original case (Speedometer on Pix6 ?) which lead to
this implementation was tested with perf/energy numbers back then?
the util_est signal high in case a task shares a rq with another
task and doesn't obtain a desired running time.
However, tasks sharing a rq obtain the running time they desire
provided that the rq has some idle time. Indeed, either:
- a CPU is always running. The utilization signal of tasks reflects
the running time they obtained. This running time depends on the
niceness of the tasks. A decreasing utilization signal doesn't
reflect a decrease of the task activity and the util_est signal
should not be decayed in this case.
- a CPU is not always running (i.e. there is some idle time). Tasks
might be waiting to run, increasing their runnable signal, but
eventually run to completion. A decreasing utilization signal
does reflect a decrease of the task activity and the util_est
signal should be decayed in this case.
This is not always true
Run a task 40ms with a period of 100ms alone on the biggest cpu at max
compute capacity. its util_avg is up to 674 at dequeue as well as its
util_est
Then start a 2nd task with the exact same behavior on the same cpu.
The util_avg of this 2nd task will be only 496 at dequeue as well as
its util_est but there is still 20ms of idle time. Furthermore, The
util_avg of the 1st task is also around 496 at dequeue but
the end of the sentence was missing...
but there is still 20ms of idle time.
But these two tasks are still able to finish there activity within this
100ms window. So why should we keep their util_est values high when
dequeuing?
But then, the util_est decreases from the original value compared to
alone whereas its utilization is the same
In the example with one task, it is possible to have a utilization as high
as we want by increasing the period. With a period of 200ms, the task
reaches a utilization of 750, and with a period of 300ms the max utilization
is 870.
Having a high utilization at dequeue is a usefull information stored in
util_est. It allows to track down that even though the utilization of the task
had time to decrease, the task actually represents a big quantity of
instructions to execute. The task should be handled accordingly.
On the other side, by decreasing the period, the lowest max utilization we
can get is 40% * 1024 = 410.
------------
By having 2 tasks sharing the CPU, the utilization graph is smoothed as one
big period of 40ms followed by 60ms of idle time becomes:
- when the 2 tasks are running, both tasks run alternatively during one sched
slice ~=4ms, so the 40ms running phase becomes a periodic phase with a period
of 8ms and a duty cycle of 50%
- the 60ms idle time is reduced to 20ms idle time for each task
The fact that these tasks could run longer than one sched slice is reflected
in the runnable signal of the tasks.
The duty cycle of the tasks in the co-scheduling phase is 50% and the duty
cycle over the 100ms period is 40%. So the utilization of the tasks can reach
40% * 1024. This is ok, tasks don't prevent each other to reach a utilization
value corresponding to their actual duty_cycle.
This patch intends to detect when a periodic task cannot reach a utilization
value of duty_cycle * 1024 due to other tasks requiring to run.
This would be the case for instance if there were 3 tasks with:
duty_cycle=40%, period=100ms, running during 300ms
In this case, the total running time of the CPU is:
3(tasks) * 40(ms) * 3(periods) = 360ms
There is no idle time during these 360ms and the utilization of tasks reaches
at most 369 (369 < 0.4*1024).
This is different from the case where the task utilization is lower than their
runnable signal. The following task:
---
To get a high util_est / low utilization value:
- Run during a long period
- Idle during a long period
Then loop n times:
- Periodic during 80ms, period=8ms, duty_cycle=51%
(note that the duty_cycle is set to 51% to be sure the running time is
superior to a sched slice of 4ms)
- Idle during 20ms
---
would:
- allow decaying util_est during the looping phase if there was one task
- not allow decaying util_est during the looping phase if there were 2 tasks.
Indeed the runnable signal of the tasks would be higher than their util
signal.
However, the looping phase doesn't represent a long and continuous amount of
instruction to execute. The profile of the task changed and the util_est
value should reflect that.
Checking the delta between the runnable and utilization signal doesn't allow to
detect that the profile of the task changed. Indeed, being runnable doesn't
mean being runnable all the time a task is runnable.
[...]
The initial patch [2] aimed to solve an issue detected while running
speedometer 2.0 [3]. While running speedometer 2.0 on a Pixel6, 3
versions are compared:
- base: the current version
What do you mean by current version ? tip/sched/core ?
I meant using the following condition:
(dequeued + UTIL_EST_MARGIN) < task_runnable(p)
- patch: the new version, with this patch applied
- revert: the initial version, with commit [2] reverted
Score (higher is better):
┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
│ base mean ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
│ 108.16 ┆ 104.06 ┆ 105.82 ┆ -3.94% ┆ -2.16% │
└────────────┴────────────┴────────────┴─────────────┴──────────────┘
┌───────────┬───────────┬────────────┐
│ base std ┆ patch std ┆ revert std │
╞═══════════╪═══════════╪════════════╡
│ 0.57 ┆ 0.49 ┆ 0.58 │
└───────────┴───────────┴────────────┘
Energy measured with energy counters:
┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
│ base mean ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
│ 141262.79 ┆ 130630.09 ┆ 134108.07 ┆ -7.52% ┆ -5.64% │
└────────────┴────────────┴────────────┴─────────────┴──────────────┘
┌───────────┬───────────┬────────────┐
│ base std ┆ patch std ┆ revert std │
╞═══════════╪═══════════╪════════════╡
│ 1347.13 ┆ 2431.67 ┆ 510.88 │
└───────────┴───────────┴────────────┘
Energy computed from util signals and energy model:
┌────────────┬────────────┬────────────┬─────────────┬──────────────┐
│ base mean ┆ patch mean ┆revert mean ┆ ratio_patch ┆ ratio_revert │
╞════════════╪════════════╪════════════╪═════════════╪══════════════╡
│ 2.0539e12 ┆ 1.3569e12 ┆ 1.3637e+12 ┆ -33.93% ┆ -33.60% │
└────────────┴────────────┴────────────┴─────────────┴──────────────┘
┌───────────┬───────────┬────────────┐
│ base std ┆ patch std ┆ revert std │
╞═══════════╪═══════════╪════════════╡
│ 2.9206e10 ┆ 2.5434e10 ┆ 1.7106e+10 │
└───────────┴───────────┴────────────┘
OU ratio in % (ratio of time being overutilized over total time).
The test lasts ~65s:
┌────────────┬────────────┬─────────────┐
│ base mean ┆ patch mean ┆ revert mean │
╞════════════╪════════════╪═════════════╡
│ 63.39% ┆ 12.48% ┆ 12.28% │
└────────────┴────────────┴─────────────┘
┌───────────┬───────────┬─────────────┐
│ base std ┆ patch std ┆ revert mean │
╞═══════════╪═══════════╪═════════════╡
│ 0.97 ┆ 0.28 ┆ 0.88 │
└───────────┴───────────┴─────────────┘
The energy gain can be explained by the fact that the system is
overutilized during most of the test with the base version.
During the test, the base condition is evaluated to true ~40%
of the time. The new condition is evaluated to true ~2% of
the time. Preventing util_est signals to decay with the base
condition has a significant impact on the overutilized state
due to an overestimation of the resulting utilization of tasks.
The score is impacted by the patch, but:
- it is expected to have slightly lower scores with EAS running more
often
- the base version making the system run at higher frequencies by
overestimating task utilization, it is expected to have higher scores
I'm not sure to get what you are trying to solve here ?
Yeah, the question is how much perf loss we accept for energy savings?
IMHO, impossible to answer generically based on one specific
workload/platform incarnation.
I think that your problem is somewhere else like the fact the /Sum of
util_est is always higher (and sometime far higher) than the actual
max util_avg because we sum the max of all tasks and as you can see in
my example above, the runnable but not running slice decrease the
util_avg of the task.
[...]
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3e9ca38512de..d058ab29e52e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5033,7 +5033,7 @@ static inline void util_est_update(struct cfs_rq *cfs_rq,
* To avoid underestimate of task utilization, skip updates of EWMA if
* we cannot grant that thread got all CPU time it wanted.
*/
- if ((dequeued + UTIL_EST_MARGIN) < task_runnable(p))
+ if (rq_no_idle_pelt(rq_of(cfs_rq)))
You can't use here the test that is done in
update_idle_rq_clock_pelt() to detect if we lost some idle time
because this test is only relevant when the rq becomes idle which is
not the case here
Do you mean this test ?
util_avg = util_sum / divider
util_sum >= divider * util_avg
with 'divider = LOAD_AVG_MAX - 1024' and 'util_avg = 1024 - 1' and upper
bound of the window (+ 1024):
util_sum >= (LOAD_AVG_MAX - 1024) << SCHED_CAPACITY_SHIFT - LOAD_AVG_MAX
Why can't we use it here?
because of the example below, it makes the filtering a nop for a very
large time and you will be overutilized far before
To estimate the amount of time a task requires to reach a certain utilization
value, I did the following:
- Computing the accumulated sum of 'pelt graph' for the first 12 * 32ms.
diff --git a/Documentation/scheduler/sched-pelt.c b/Documentation/scheduler/sched-pelt.c
index 7238b355919c..62df61f6e968 100644
--- a/Documentation/scheduler/sched-pelt.c
+++ b/Documentation/scheduler/sched-pelt.c
@@ -38,7 +38,7 @@ void calc_runnable_avg_yN_sum(void)
int i;
printf("static const u32 runnable_avg_yN_sum[] = {\n\t 0,");
- for (i = 1; i <= HALFLIFE; i++) {
+ for (i = 1; i <= 12 * HALFLIFE; i++) {
if (i == 1)
sum *= y;
else
@@ -103,7 +103,7 @@ void main(void)
y = pow(0.5, 1/(double)HALFLIFE);
calc_runnable_avg_yN_inv();
-// calc_runnable_avg_yN_sum();
+ calc_runnable_avg_yN_sum();
calc_converged_max();
// calc_accumulated_sum_32();
}
- This yields:
static const u32 runnable_avg_yN_sum[] = {
0, 1002, 1982, 2941, 3880, 4798, 5697, 6576, 7437, 8279, 9103,
9909,10698,11470,12226,12966,13690,14398,15091,15769,16433,17082,
17718,18340,18949,19545,20128,20698,21256,21802,22336,22859,23371,
23872,24362,24842,25311,25770,26219,26659,27089,27510,27922,28325,
28720,29106,29484,29854,30216,30570,30917,31256,31588,31913,32231,
32542,32846,33144,33435,33720,33999,34272,34539,34800,35056,35306,
35551,35791,36026,36256,36481,36701,36916,37127,37333,37535,37732,
37925,38114,38299,38480,38657,38830,39000,39166,39328,39487,39642,
39794,39943,40089,40232,40371,40507,40641,40772,40900,41025,41147,
41267,41384,41499,41611,41721,41829,41934,42037,42138,42237,42334,
42428,42520,42610,42699,42786,42871,42954,43035,43114,43192,43268,
43342,43415,43486,43556,43624,43691,43756,43820,43883,43944,44004,
44063,44120,44176,44231,44285,44338,44389,44439,44488,44536,44583,
44629,44674,44718,44761,44803,44845,44886,44926,44965,45003,45040,
45076,45112,45147,45181,45214,45247,45279,45310,45341,45371,45400,
45429,45457,45485,45512,45538,45564,45589,45614,45638,45662,45685,
45708,45730,45752,45773,45794,45814,45834,45853,45872,45891,45909,
45927,45944,45961,45978,45994,46010,46026,46041,46056,46071,46085,
46099,46113,46126,46139,46152,46165,46177,46189,46201,46213,46224,
46235,46246,46257,46267,46277,46287,46297,46307,46316,46325,46334,
46343,46352,46360,46368,46376,46384,46392,46399,46406,46413,46420,
46427,46434,46441,46447,46453,46459,46465,46471,46477,46483,46489,
46494,46499,46504,46509,46514,46519,46524,46529,46534,46538,46542,
46546,46550,46554,46558,46562,46566,46570,46574,46578,46581,46584,
46587,46590,46593,46596,46599,46602,46605,46608,46611,46614,46617,
46620,46623,46626,46628,46630,46632,46634,46636,46638,46640,46642,
46644,46646,46648,46650,46652,46654,46656,46658,46660,46662,46664,
46666,46668,46670,46672,46673,46674,46675,46676,46677,46678,46679,
46680,46681,46682,46683,46684,46685,46686,46687,46688,46689,46690,
46691,46692,46693,46694,46695,46696,46697,46698,46699,46700,46701,
46702,46703,46704,46705,46706,46707,46708,46709,46710,46711,46712,
46713,46714,46715,46716,46717,46718,46718,46718,46718,46718,46718,
46718,46718,46718,46718,46718,46718,46718,46718,46718,46718,46718,
46718,46718,46718,46718,46718,46718,46718,46718,46718,46718,46718,
46718,46718,46718,46718,46718,46718,46718,46718,46718,46718,46718,
};
- Due to some approximations during the computation (I presume) the accumulated
sum doesn't converge toward 47742, but toward 46718, so I'll use 46718.
- Computing the estimated utilization after Xms of running time with, for each
value V: estimated_util = 1024 * (V / 46718)
This yields the following util estimation array. For instance, after 12ms,
the estimated utilization is 234.
0: 0 21 43 64 85 105 124 144 163 181
10: 199 217 234 251 267 284 300 315 330 345
20: 360 374 388 401 415 428 441 453 465 477
30: 489 501 512 523 533 544 554 564 574 584
40: 593 602 612 620 629 637 646 654 662 670
50: 677 685 692 699 706 713 719 726 732 739
60: 745 751 757 762 768 773 779 784 789 794
70: 799 804 809 813 818 822 827 831 835 839
80: 843 847 851 854 858 862 865 868 872 875
90: 878 881 884 887 890 893 896 899 901 904
100: 907 909 912 914 916 919 921 923 925 927
110: 929 931 933 935 937 939 941 943 945 946
120: 948 950 951 953 954 956 957 959 960 961
130: 963 964 965 967 968 969 970 971 972 974
140: 975 976 977 978 979 980 981 982 982 983
150: 984 985 986 987 988 988 989 990 991 991
160: 992 993 993 994 995 995 996 996 997 998
170: 998 999 999 1000 1000 1001 1001 1002 1002 1003
180: 1003 1004 1004 1005 1005 1005 1006 1006 1007 1007
190: 1007 1008 1008 1008 1009 1009 1009 1010 1010 1010
200: 1011 1011 1011 1011 1012 1012 1012 1012 1013 1013
210: 1013 1013 1014 1014 1014 1014 1014 1015 1015 1015
220: 1015 1015 1016 1016 1016 1016 1016 1017 1017 1017
230: 1017 1017 1017 1017 1018 1018 1018 1018 1018 1018
240: 1018 1018 1019 1019 1019 1019 1019 1019 1019 1019
250: 1019 1020 1020 1020 1020 1020 1020 1020 1020 1020
260: 1020 1020 1020 1021 1021 1021 1021 1021 1021 1021
270: 1021 1021 1021 1021 1021 1021 1021 1021 1022 1022
280: 1022 1022 1022 1022 1022 1022 1022 1022 1022 1022
290: 1022 1022 1022 1022 1022 1022 1022 1022 1022 1022
300: 1022 1023 1023 1023 1023 1023 1023 1023 1023 1023
310: 1023 1023 1023 1023 1023 1023 1023 1023 1023 1023
320: 1023 1023 1023 1023 1023 1023 1023 1023 1023 1023
330: 1023 1023 1023 1023 1023 1023 1023 1023 1023 1023
340: 1023 1023 1023 1023 1023 1023 1024 1024 1024 1024
350: 1024 1024 1024 1024 1024 1024 1024 1024 1024 1024
360: 1024 1024 1024 1024 1024 1024 1024 1024 1024 1024
370: 1024 1024 1024 1024 1024 1024 1024 1024 1024 1024
380: 1024 1024 1024 1024 1024
On a Pixel6, a little CPU reached 169 utilization after 52ms of continuous
running time. The capacity of little CPUs is 160. So due to time scaling,
this should match the value at (52 * (160 / 1024))= 8.12ms.
The estimated utilization at 8ms is 163, so this should match. Just to be
accurate:
- 52ms: wall clock time
- 8ms: pelt scaled time
The other way around, a CPU with a capacity of 963 (cf the array above at 180ms)
will reach a utilization of 1003 after (130 * (1024 / 1003))= 183ms
- 183ms: wall clock time
- 180ms: pelt scaled time
------------
All of this just to highlight that:
- being overutilized already depends on the capacity of a CPU
- the lower the capacity, the easier it is to become overutilized
This is if overutilized means 'having a CPU utilization reaching 80% of a
CPU capacity'.
If overutilized means 'not having enough compute power to correctly estimate
a task utilization', then indeed it takes 2.07s for a 160-capacity CPU to
realize that. But FWIU, this is the current behaviour as CPUs have the ability
to estimate a task utilization beyond their own capacity.
I don't see why having 2 tasks instead of 1 would make a difference, their
utilization would just raise half fast as if their were alone on the CPU,
but nothing more IIUC.
------------
Also, I think the original issue is to detect cases where tasks cannot reach
a max utilization corresponding to their duty cycle. I.e. cases where the
utilization of a task is always strictly below the value
(task_duty_cycle * 1024). This being due to other tasks preventing to run
as much time as desired.
I don't think this is what happens when 2 tasks run on a non-big CPU, as long
as there is idle time on the non-big CPU. This even though their respective
utilization goes above the CPU capacity.
On a 512-capacity CPU, 2 periodic tasks with a duty cycle of 20% and a period
of 100ms should have correct utilization values, even if the utilization of the
CPU goes above its capacity. On the Pixel6 where mid CPUs have a capacity of
498, these tasks reach a utilization of 323, and the CPU reaches a utilization
of 662.
With this test you skip completely the cases where the task has to
share the CPU with others. As an example on the pixel 6, the little
True. But I assume that's anticipated here. The assumption is that as
long as there is idle time, tasks get what they want in a time frame.
cpus must run more than 1.2 seconds at its max freq before detecting
that there is no idle time
BTW, I tried to figure out where the 1.2s comes from: 323ms * 1024/160 =
2.07s (with CPU capacity of Pix5 little CPU = 160)?
yeah, I use the wrong rb5 little capacity instead of pixel6 but that even worse
[...]