Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized

From: Pierre Gondois
Date: Fri Oct 11 2024 - 08:52:32 EST


Hello Vincent,

On 10/9/24 10:53, Vincent Guittot wrote:
Hi Pierre,

On Mon, 7 Oct 2024 at 09:03, Pierre Gondois <pierre.gondois@xxxxxxx> wrote:

Hello Vincent,

Sorry for the delay:

On 9/25/24 15:28, Vincent Guittot wrote:
On Thu, 19 Sept 2024 at 10:26, Pierre Gondois <pierre.gondois@xxxxxxx> wrote:

Hello Vincent,
I tried this patch on a Pixel 6 (8 CPUs, 4 little, 2 mid, 2 big)
with patches 1-4/5 using these workloads:
---
A.
a. 8 tasks at 2%/5%/10% during 1s
b. 1 task:
- sleeping during 0.3s
- at 100% during 0.3s
- sleeping during 0.3s

b. is used to reach the overutilized state during a limited amount of time.
EAS is then operating, then the load balancer does the task placement, then EAS
is operating again.

B.
a. 8 tasks at 2%/5%/10% during 1s
b. 1 task:
- at 100% during 1s

---
I'm seeing the energy consumption increase in some cases. This seems to be
due to feec() migrating tasks more often than what the load balancer does
for this workload. This leads to utilization 'spikes' and then frequency
'spikes', increasing the overall energy consumption.
This is not entirely related to this patch though,, as the task placement seems
correct. I.e. feec() effectively does an optimal placement given the EM and
task utilization. The task placement is just a bit less stable than with
the load balancer.

Would patch 5 help to keep things better placed ? in particular if
task b is misplaced at some point because of load balance ?

I assume so, it would require more testing on my side


I agree that load balance might still contribute to migrate task in a
not energy efficient way


---
Regarding hackbench, I've reproduced the test you've run on the same Pixel6.
I have CONFIG_SCHED_CLUSTER enabled, which allows having a sched domain for
each little/mid/big CPUs (without the config, these group would no exist).

Why did you do this ? All cpus are expected to be in same sched domain
as they share their LLC

I did this to observe the loa balancer a bit more carefully while reviewing
the first patch:
sched/fair: Filter false overloaded_group case for EAS
I've let this configuration, but effectively this should not bring anything more.




I see an important regression in the result.
I replaced the condition to run feec() through select_task_rq_fair() by:
if (get_rd_overloaded(cpu_rq(cpu)->rd) == 0)) {

overloaded is enable when more than 1 task runs on a cpu whatever the
utilization

Yes right, this idea has little sense.


new_cpu = find_energy_efficient_cpu(p, prev_cpu);
...
}
and obtained better results.

Indeed, for such intensive workload:
- EAS would not have any energy benefit, so better prioritize performance
(as Christian mentioned)
- EAS would not be able to find a fitting CPU, so running feec() should be
avoided
- as you mention in the commit message, shuffling tasks when one CPU becomes
momentarily overutilized is inefficient energy-wise (even though I don't have
the numbers, it should make sense).
So detecting when the system is overloaded should be a better compromise I
assume. The condition in sched_balance_find_src_group() to let the load balancer
operate might also need to be updated.

Note:
- base: with patches 1-4/5
- _ou: run feec() when not overutilized
- _ol: run feec() when not overloaded
- mean: hackbench execution time in s.
- delta: negative is better. Value is in percentage.

Could you share your command line ? As explained in the cover letter I
have seen some perf regressions but not in the range that you have
below

What is your base ? tip/sched/core ?

I am working on a Pixel6, with a branch based on v6.8 with some scheduler patches
to be able to apply your patches cleanly.

TBH, I'm always cautious with those kind of frankenstein kernel
especially with all changes that have happened on the scheduler since
v6.8 compared to tip/sched/core

Yes I understand, I'll re-test it on a Juno with a newer kernel.



The mapping id -> command line is as:

Thanks for the commands details, I'm going to have a look

(1) hackbench -l 5120 -g 1
(2) hackbench -l 1280 -g 4
(3) hackbench -l 640 -g 8
(4) hackbench -l 320 -g 16

(5) hackbench -p -l 5120 -g 1
(6) hackbench -p -l 1280 -g 4
(7) hackbench -p -l 640 -g 8
(8) hackbench -p -l 320 -g 16

(9) hackbench -T -l 5120 -g 1
(10) hackbench -T -l 1280 -g 4
(11) hackbench -T -l 640 -g 8
(12) hackbench -T -l 320 -g 16

(13) hackbench -T -p -l 5120 -g 1
(14) hackbench -T -p -l 1280 -g 4
(15) hackbench -T -p -l 640 -g 8
(16) hackbench -T -p -l 320 -g 16



┌─────┬───────────┬──────────┬─────────┬──────────┬─────────┬──────────┬──────────┬──────────┐
│ id ┆ mean_base ┆ std_base ┆ mean_ou ┆ std_ou ┆ mean_ol ┆ std_ol ┆ delta_ou ┆ delta_ol │
╞═════╪═══════════╪══════════╪═════════╪══════════╪═════════╪══════════╪══════════╪══════════╡
│ 1 ┆ 1.9786 ┆ 0.04719 ┆ 3.0856 ┆ 0.122209 ┆ 2.1734 ┆ 0.045203 ┆ 55.95 ┆ 9.85 │

I might have misunderstood your results above last time.
mean_base results include patches 1 to 4 and mean_ou revert patch 4.
Does it mean that it is 55% better with patch 4 ? I originally thought
there was a regression with patch 4 but I'm not sure that I understood
correctly after re reading the table.

The columns are:
- the _base configuration disables EAS/feec() when in the overutilized state,
i.e. patches 1-3 are applied.
- the _ou configuration keeps running EAS/feec() when in the overutilized state
i.e. patches 1-4 are applied
- the _ol configuration should be ignored as previously established



│ 2 ┆ 1.8991 ┆ 0.019768 ┆ 2.6672 ┆ 0.135266 ┆ 1.98875 ┆ 0.055132 ┆ 40.45 ┆ 4.72 │
│ 3 ┆ 1.9053 ┆ 0.014795 ┆ 2.5761 ┆ 0.141693 ┆ 2.06425 ┆ 0.045901 ┆ 35.21 ┆ 8.34 │
│ 4 ┆ 1.9586 ┆ 0.023439 ┆ 2.5823 ┆ 0.110399 ┆ 2.0955 ┆ 0.053818 ┆ 31.84 ┆ 6.99 │
│ 5 ┆ 1.746 ┆ 0.055676 ┆ 3.3437 ┆ 0.279107 ┆ 1.88 ┆ 0.038184 ┆ 91.51 ┆ 7.67 │
│ 6 ┆ 1.5476 ┆ 0.050131 ┆ 2.6835 ┆ 0.140497 ┆ 1.5645 ┆ 0.081644 ┆ 73.4 ┆ 1.09 │
│ 7 ┆ 1.4562 ┆ 0.062457 ┆ 2.3568 ┆ 0.119213 ┆ 1.48425 ┆ 0.06212 ┆ 61.85 ┆ 1.93 │
│ 8 ┆ 1.3554 ┆ 0.031757 ┆ 2.0609 ┆ 0.112869 ┆ 1.4085 ┆ 0.036601 ┆ 52.05 ┆ 3.92 │
│ 9 ┆ 2.0391 ┆ 0.035732 ┆ 3.4045 ┆ 0.277307 ┆ 2.2155 ┆ 0.019053 ┆ 66.96 ┆ 8.65 │
│ 10 ┆ 1.9247 ┆ 0.056472 ┆ 2.6605 ┆ 0.119417 ┆ 2.02775 ┆ 0.05795 ┆ 38.23 ┆ 5.35 │
│ 11 ┆ 1.8923 ┆ 0.038222 ┆ 2.8113 ┆ 0.120623 ┆ 2.089 ┆ 0.025259 ┆ 48.57 ┆ 10.39 │
│ 12 ┆ 1.9444 ┆ 0.034856 ┆ 2.6675 ┆ 0.219585 ┆ 2.1035 ┆ 0.076514 ┆ 37.19 ┆ 8.18 │
│ 13 ┆ 1.7107 ┆ 0.04874 ┆ 3.4443 ┆ 0.154481 ┆ 1.8275 ┆ 0.036665 ┆ 101.34 ┆ 6.83 │
│ 14 ┆ 1.5565 ┆ 0.056595 ┆ 2.8241 ┆ 0.158643 ┆ 1.5515 ┆ 0.040813 ┆ 81.44 ┆ -0.32 │
│ 15 ┆ 1.4932 ┆ 0.085256 ┆ 2.6841 ┆ 0.135623 ┆ 1.50475 ┆ 0.028336 ┆ 79.75 ┆ 0.77 │
│ 16 ┆ 1.4263 ┆ 0.067666 ┆ 2.3971 ┆ 0.145928 ┆ 1.414 ┆ 0.061422 ┆ 68.06 ┆ -0.86 │
└─────┴───────────┴──────────┴─────────┴──────────┴─────────┴──────────┴──────────┴──────────┘

On 9/17/24 22:24, Christian Loehle wrote:
On 8/30/24 14:03, Vincent Guittot wrote:
Keep looking for an energy efficient CPU even when the system is
overutilized and use the CPU returned by feec() if it has been able to find
one. Otherwise fallback to the default performance and spread mode of the
scheduler.
A system can become overutilized for a short time when workers of a
workqueue wake up for a short background work like vmstat update.
Continuing to look for a energy efficient CPU will prevent to break the
power packing of tasks.

Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2273eecf6086..e46af2416159 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
cpumask_test_cpu(cpu, p->cpus_ptr))
return cpu;

- if (!is_rd_overutilized(this_rq()->rd)) {
+ if (sched_energy_enabled()) {
new_cpu = find_energy_efficient_cpu(p, prev_cpu);
if (new_cpu >= 0)
return new_cpu;

Super quick testing on pixel6:
for i in $(seq 0 6); do /data/local/tmp/hackbench -l 500 -g 100 | grep Time; sleep 60; done
with patch 5/5 only:
Time: 19.433
Time: 19.657
Time: 19.851
Time: 19.789
Time: 19.857
Time: 20.092
Time: 19.973

mainline:
Time: 18.836
Time: 18.718
Time: 18.781
Time: 19.015
Time: 19.061
Time: 18.950
Time: 19.166


The reason we didn't always have this enabled is the belief that
this costs us too much performance in scenarios we most need it
while at best making subpar EAS decisions anyway (in an
overutilized state).
I'd be open for questioning that, but why the change of mind?
And why is this necessary in your series if the EAS selection
isn't 'final' (until the next sleep) anymore (Patch 5/5)?