Re: [RFC PATCH 4/5] sched/fair: Use EAS also when overutilized
From: Vincent Guittot
Date: Wed Oct 09 2024 - 04:57:43 EST
Hi Pierre,
On Mon, 7 Oct 2024 at 09:03, Pierre Gondois <pierre.gondois@xxxxxxx> wrote:
>
> Hello Vincent,
>
> Sorry for the delay:
>
> On 9/25/24 15:28, Vincent Guittot wrote:
> > On Thu, 19 Sept 2024 at 10:26, Pierre Gondois <pierre.gondois@xxxxxxx> wrote:
> >>
> >> Hello Vincent,
> >> I tried this patch on a Pixel 6 (8 CPUs, 4 little, 2 mid, 2 big)
> >> with patches 1-4/5 using these workloads:
> >> ---
> >> A.
> >> a. 8 tasks at 2%/5%/10% during 1s
> >> b. 1 task:
> >> - sleeping during 0.3s
> >> - at 100% during 0.3s
> >> - sleeping during 0.3s
> >>
> >> b. is used to reach the overutilized state during a limited amount of time.
> >> EAS is then operating, then the load balancer does the task placement, then EAS
> >> is operating again.
> >>
> >> B.
> >> a. 8 tasks at 2%/5%/10% during 1s
> >> b. 1 task:
> >> - at 100% during 1s
> >>
> >> ---
> >> I'm seeing the energy consumption increase in some cases. This seems to be
> >> due to feec() migrating tasks more often than what the load balancer does
> >> for this workload. This leads to utilization 'spikes' and then frequency
> >> 'spikes', increasing the overall energy consumption.
> >> This is not entirely related to this patch though,, as the task placement seems
> >> correct. I.e. feec() effectively does an optimal placement given the EM and
> >> task utilization. The task placement is just a bit less stable than with
> >> the load balancer.
> >
> > Would patch 5 help to keep things better placed ? in particular if
> > task b is misplaced at some point because of load balance ?
>
> I assume so, it would require more testing on my side
>
> >
> > I agree that load balance might still contribute to migrate task in a
> > not energy efficient way
> >
> >>
> >> ---
> >> Regarding hackbench, I've reproduced the test you've run on the same Pixel6.
> >> I have CONFIG_SCHED_CLUSTER enabled, which allows having a sched domain for
> >> each little/mid/big CPUs (without the config, these group would no exist).
> >
> > Why did you do this ? All cpus are expected to be in same sched domain
> > as they share their LLC
>
> I did this to observe the loa balancer a bit more carefully while reviewing
> the first patch:
> sched/fair: Filter false overloaded_group case for EAS
> I've let this configuration, but effectively this should not bring anything more.
>
>
> >
> >>
> >> I see an important regression in the result.
> >> I replaced the condition to run feec() through select_task_rq_fair() by:
> >> if (get_rd_overloaded(cpu_rq(cpu)->rd) == 0)) {
> >
> > overloaded is enable when more than 1 task runs on a cpu whatever the
> > utilization
>
> Yes right, this idea has little sense.
>
> >
> >> new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> >> ...
> >> }
> >> and obtained better results.
> >>
> >> Indeed, for such intensive workload:
> >> - EAS would not have any energy benefit, so better prioritize performance
> >> (as Christian mentioned)
> >> - EAS would not be able to find a fitting CPU, so running feec() should be
> >> avoided
> >> - as you mention in the commit message, shuffling tasks when one CPU becomes
> >> momentarily overutilized is inefficient energy-wise (even though I don't have
> >> the numbers, it should make sense).
> >> So detecting when the system is overloaded should be a better compromise I
> >> assume. The condition in sched_balance_find_src_group() to let the load balancer
> >> operate might also need to be updated.
> >>
> >> Note:
> >> - base: with patches 1-4/5
> >> - _ou: run feec() when not overutilized
> >> - _ol: run feec() when not overloaded
> >> - mean: hackbench execution time in s.
> >> - delta: negative is better. Value is in percentage.
> >
> > Could you share your command line ? As explained in the cover letter I
> > have seen some perf regressions but not in the range that you have
> > below
> >
> > What is your base ? tip/sched/core ?
>
> I am working on a Pixel6, with a branch based on v6.8 with some scheduler patches
> to be able to apply your patches cleanly.
TBH, I'm always cautious with those kind of frankenstein kernel
especially with all changes that have happened on the scheduler since
v6.8 compared to tip/sched/core
>
> The mapping id -> command line is as:
Thanks for the commands details, I'm going to have a look
> (1) hackbench -l 5120 -g 1
> (2) hackbench -l 1280 -g 4
> (3) hackbench -l 640 -g 8
> (4) hackbench -l 320 -g 16
>
> (5) hackbench -p -l 5120 -g 1
> (6) hackbench -p -l 1280 -g 4
> (7) hackbench -p -l 640 -g 8
> (8) hackbench -p -l 320 -g 16
>
> (9) hackbench -T -l 5120 -g 1
> (10) hackbench -T -l 1280 -g 4
> (11) hackbench -T -l 640 -g 8
> (12) hackbench -T -l 320 -g 16
>
> (13) hackbench -T -p -l 5120 -g 1
> (14) hackbench -T -p -l 1280 -g 4
> (15) hackbench -T -p -l 640 -g 8
> (16) hackbench -T -p -l 320 -g 16
>
>
> >
> >> ┌─────┬───────────┬──────────┬─────────┬──────────┬─────────┬──────────┬──────────┬──────────┐
> >> │ id ┆ mean_base ┆ std_base ┆ mean_ou ┆ std_ou ┆ mean_ol ┆ std_ol ┆ delta_ou ┆ delta_ol │
> >> ╞═════╪═══════════╪══════════╪═════════╪══════════╪═════════╪══════════╪══════════╪══════════╡
> >> │ 1 ┆ 1.9786 ┆ 0.04719 ┆ 3.0856 ┆ 0.122209 ┆ 2.1734 ┆ 0.045203 ┆ 55.95 ┆ 9.85 │
I might have misunderstood your results above last time.
mean_base results include patches 1 to 4 and mean_ou revert patch 4.
Does it mean that it is 55% better with patch 4 ? I originally thought
there was a regression with patch 4 but I'm not sure that I understood
correctly after re reading the table.
> >> │ 2 ┆ 1.8991 ┆ 0.019768 ┆ 2.6672 ┆ 0.135266 ┆ 1.98875 ┆ 0.055132 ┆ 40.45 ┆ 4.72 │
> >> │ 3 ┆ 1.9053 ┆ 0.014795 ┆ 2.5761 ┆ 0.141693 ┆ 2.06425 ┆ 0.045901 ┆ 35.21 ┆ 8.34 │
> >> │ 4 ┆ 1.9586 ┆ 0.023439 ┆ 2.5823 ┆ 0.110399 ┆ 2.0955 ┆ 0.053818 ┆ 31.84 ┆ 6.99 │
> >> │ 5 ┆ 1.746 ┆ 0.055676 ┆ 3.3437 ┆ 0.279107 ┆ 1.88 ┆ 0.038184 ┆ 91.51 ┆ 7.67 │
> >> │ 6 ┆ 1.5476 ┆ 0.050131 ┆ 2.6835 ┆ 0.140497 ┆ 1.5645 ┆ 0.081644 ┆ 73.4 ┆ 1.09 │
> >> │ 7 ┆ 1.4562 ┆ 0.062457 ┆ 2.3568 ┆ 0.119213 ┆ 1.48425 ┆ 0.06212 ┆ 61.85 ┆ 1.93 │
> >> │ 8 ┆ 1.3554 ┆ 0.031757 ┆ 2.0609 ┆ 0.112869 ┆ 1.4085 ┆ 0.036601 ┆ 52.05 ┆ 3.92 │
> >> │ 9 ┆ 2.0391 ┆ 0.035732 ┆ 3.4045 ┆ 0.277307 ┆ 2.2155 ┆ 0.019053 ┆ 66.96 ┆ 8.65 │
> >> │ 10 ┆ 1.9247 ┆ 0.056472 ┆ 2.6605 ┆ 0.119417 ┆ 2.02775 ┆ 0.05795 ┆ 38.23 ┆ 5.35 │
> >> │ 11 ┆ 1.8923 ┆ 0.038222 ┆ 2.8113 ┆ 0.120623 ┆ 2.089 ┆ 0.025259 ┆ 48.57 ┆ 10.39 │
> >> │ 12 ┆ 1.9444 ┆ 0.034856 ┆ 2.6675 ┆ 0.219585 ┆ 2.1035 ┆ 0.076514 ┆ 37.19 ┆ 8.18 │
> >> │ 13 ┆ 1.7107 ┆ 0.04874 ┆ 3.4443 ┆ 0.154481 ┆ 1.8275 ┆ 0.036665 ┆ 101.34 ┆ 6.83 │
> >> │ 14 ┆ 1.5565 ┆ 0.056595 ┆ 2.8241 ┆ 0.158643 ┆ 1.5515 ┆ 0.040813 ┆ 81.44 ┆ -0.32 │
> >> │ 15 ┆ 1.4932 ┆ 0.085256 ┆ 2.6841 ┆ 0.135623 ┆ 1.50475 ┆ 0.028336 ┆ 79.75 ┆ 0.77 │
> >> │ 16 ┆ 1.4263 ┆ 0.067666 ┆ 2.3971 ┆ 0.145928 ┆ 1.414 ┆ 0.061422 ┆ 68.06 ┆ -0.86 │
> >> └─────┴───────────┴──────────┴─────────┴──────────┴─────────┴──────────┴──────────┴──────────┘
> >>
> >> On 9/17/24 22:24, Christian Loehle wrote:
> >>> On 8/30/24 14:03, Vincent Guittot wrote:
> >>>> Keep looking for an energy efficient CPU even when the system is
> >>>> overutilized and use the CPU returned by feec() if it has been able to find
> >>>> one. Otherwise fallback to the default performance and spread mode of the
> >>>> scheduler.
> >>>> A system can become overutilized for a short time when workers of a
> >>>> workqueue wake up for a short background work like vmstat update.
> >>>> Continuing to look for a energy efficient CPU will prevent to break the
> >>>> power packing of tasks.
> >>>>
> >>>> Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> >>>> ---
> >>>> kernel/sched/fair.c | 2 +-
> >>>> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>> index 2273eecf6086..e46af2416159 100644
> >>>> --- a/kernel/sched/fair.c
> >>>> +++ b/kernel/sched/fair.c
> >>>> @@ -8505,7 +8505,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
> >>>> cpumask_test_cpu(cpu, p->cpus_ptr))
> >>>> return cpu;
> >>>>
> >>>> - if (!is_rd_overutilized(this_rq()->rd)) {
> >>>> + if (sched_energy_enabled()) {
> >>>> new_cpu = find_energy_efficient_cpu(p, prev_cpu);
> >>>> if (new_cpu >= 0)
> >>>> return new_cpu;
> >>>
> >>> Super quick testing on pixel6:
> >>> for i in $(seq 0 6); do /data/local/tmp/hackbench -l 500 -g 100 | grep Time; sleep 60; done
> >>> with patch 5/5 only:
> >>> Time: 19.433
> >>> Time: 19.657
> >>> Time: 19.851
> >>> Time: 19.789
> >>> Time: 19.857
> >>> Time: 20.092
> >>> Time: 19.973
> >>>
> >>> mainline:
> >>> Time: 18.836
> >>> Time: 18.718
> >>> Time: 18.781
> >>> Time: 19.015
> >>> Time: 19.061
> >>> Time: 18.950
> >>> Time: 19.166
> >>>
> >>>
> >>> The reason we didn't always have this enabled is the belief that
> >>> this costs us too much performance in scenarios we most need it
> >>> while at best making subpar EAS decisions anyway (in an
> >>> overutilized state).
> >>> I'd be open for questioning that, but why the change of mind?
> >>> And why is this necessary in your series if the EAS selection
> >>> isn't 'final' (until the next sleep) anymore (Patch 5/5)?
> >>>