Re: [PATCH 00/24] Complete EEVDF

From: K Prateek Nayak
Date: Wed Nov 06 2024 - 01:20:32 EST


(+ Mike, Luis)

Hello Saravana, Sam, David,

On 11/6/2024 6:37 AM, Saravana Kannan wrote:
On Sat, Jul 27, 2024 at 3:27 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:

Hi all,

So after much delay this is hopefully the final version of the EEVDF patches.
They've been sitting in my git tree for ever it seems, and people have been
testing it and sending fixes.

I've spend the last two days testing and fixing cfs-bandwidth, and as far
as I know that was the very last issue holding it back.

These patches apply on top of queue.git sched/dl-server, which I plan on merging
in tip/sched/core once -rc1 drops.

I'm hoping to then merge all this (+- the DVFS clock patch) right before -rc2.


Aside from a ton of bug fixes -- thanks all! -- new in this version is:

- split up the huge delay-dequeue patch
- tested/fixed cfs-bandwidth
- PLACE_REL_DEADLINE -- preserve the relative deadline when migrating
- SCHED_BATCH is equivalent to RESPECT_SLICE
- propagate min_slice up cgroups
- CLOCK_THREAD_DVFS_ID


Hi Peter,

TL;DR:
We run some basic sched/cpufreq behavior tests on a Pixel 6 for every
change we accept. Some of these changes are merges from Linus's tree.
We can see a very clear change in behavior with this patch series.
Based on what we are seeing, we'd expect this change in behavior to
cause pretty serious power regression (7-25%) depending on what the
actual bug is and the use case.

Do the regressions persist with NO_DELAY_DEQUEUE? You can disable the
DELAY_DEQUEUE feature added in EEVDF Complete via debugfs by doing a:

# echo NO_DELAY_DEQUEUE > /sys/kernel/debug/sched/features

Since delayed entities are still on the runqueue, they can affect PELT
calculation. Vincent and Dietmar have both noted this and Peter posted
https://lore.kernel.org/lkml/172595576232.2215.18027704125134691219.tip-bot2@tip-bot2/
in response but it was pulled out since Luis reported observing -ve
values for h_nr_delayed on his setup. A lot has been fixed around
delayed dequeue since and I wonder if now would be the right time to
re-attempt h_nr_delayed tracking.

There is also the fact that delayed entities don't update the tg
loadavg since the delayed path calls update_load_avg() without
UPDATE_TG flag set which can again cause some changes in PELT
calculation since the averages are used to estimate the entity
shares when running with cgroups.


Intro:
We run these tests 20 times for every build (a bunch of changes). All
the data below is from the 20+ builds before this series and 20 builds
after this series (inclusive). So, all the "before numbers" are from
(20 x 20) 400+ runs and all the after numbers are from another 400+
runs.

Test:
We create a synthetic "tiny" thread that runs for 3ms and sleeps for
10ms at Fmin. We let it run like this for several seconds to make sure
the util is low and all the "new thread" boost stuff isn't kicking in.
So, at this point, the CPU frequency is at Fmin. Then we let this
thread run continuously without sleeping and measure (using ftrace)
the time it takes for the CPU to get to Fmax.

We do this separately (fresh run) on the Pixel 6 with the cpu affinity
set to each cluster and once without any cpu affinity (thread starts
at little).

Data:
All the values below are in milliseconds.

When the thread is not affined to any CPU: So thread starts on little,
ramps up to fmax, migrates to middle, ramps up to fmax, migrates to
big, ramps up to fmax.
+----------------------------------+
| Data | Before | After |
|-----------------------+----------|
| 5th percentile | 169 | 151 |
|-----------------------+----------|
| Median | 221 | 177 |
|-----------------------+----------|
| Mean | 221 | 177 |
|-----------------------+----------|
| 95th percentile | 249 | 200 |
+----------------------------------+

When thread is affined to the little cluster:
The average time to reach Fmax is 104 ms without this series and 66 ms
after this series. We didn't collect the individual per run data. We
can if you really need it. We also noticed that the little cluster
wouldn't go to Fmin (300 MHz) after this series even when the CPUs are
mostly idle. It was instead hovering at 738 MHz (the Fmax is ~1800
MHz).

When thread is affined to the middle cluster:
+----------------------------------+
| Data | Before | After |
|-----------------------+----------|
| 5th percentile | 99 | 84 |
|-----------------------+----------|
| Median | 111 | 104 |
|-----------------------+----------|
| Mean | 111 | 104 |
|-----------------------+----------|
| 95th percentile | 120 | 119 |
+----------------------------------+

When thread is affined to the big cluster:
+----------------------------------+
| Data | Before | After |
|-----------------------+----------|
| 5th percentile | 138 | 96 |
|-----------------------+----------|
| Median | 147 | 144 |
|-----------------------+----------|
| Mean | 145 | 134 |
|-----------------------+----------|
| 95th percentile | 151 | 150 |
+----------------------------------+

As you can see, the ramp up time has decreased noticeably. Also, as
you can tell from the 5th percentile numbers, the standard deviation
has also increased a lot too, causing a wider spread of the ramp up
time (more noticeable in the middle and big clusters). So in general
this looks like it's going to increase the usage of the middle and big
CPU clusters and also going to shift the CPU frequency residency to
frequencies that are 5 to 25% higher.

We already checked the rate_limit_us value and it is the same for both
the before/after cases and it's set to 7.5 ms (jiffies is 4ms in our
case). Also, based on my limited understanding the DELAYED_DEQUEUE
stuff is only relevant if there are multiple contending threads in a
CPU. In this case it's just 1 continuously running thread with a
kworker that runs sporadically less than 1% of the time.

There is an ongoing investigation on delayed entities possibly not
migrating if they are woken up before they are fully dequeued. Since you
mention there is only one task, this should not matter but could you
also try out Mike's suggestion from
https://lore.kernel.org/lkml/1bffa5f2ca0fec8a00f84ffab86dc6e8408af31c.camel@xxxxxx/
and see if it makes a difference on your test suite?

--
Thanks and Regards,
Prateek


So, without a deeper understanding of this patch series, it's behaving
as if the PELT signal is accumulating faster than expected. Which is a
bit surprising to me because AFAIK (which is not much) the EEVDF
series isn't supposed to change the PELT behavior.

If you want to get a visual idea of what the system is doing, here are
some perfetto links that visualize the traces. Hopefully you have
access permissions to these. You can use the W, S, A, D keys to pan
and zoom around the timeline.

Big Before:
https://ui.perfetto.dev/#!/?s=01aa3ad3a5ddd78f2948c86db4265ce2249da8aa
Big After:
https://ui.perfetto.dev/#!/?s=7729ee012f238e96cfa026459eac3f8c3e88d9a9

P.S. I only gave a quick glance but I do see the frequency ramp up with
larger deltas and reach Fmax much quickly in case of "Big After"


Thanks,
Saravana, Sam and David