Re: [RFC][PATCH 08/10] sched/fair: Implement delayed dequeue

From: Luis Machado
Date: Wed Jun 12 2024 - 11:12:20 EST


On 6/5/24 10:42, Peter Zijlstra wrote:
> On Wed, Jun 05, 2024 at 10:14:47AM +0100, Luis Machado wrote:
>> ... thanks for the patch! The above seems to do it for me. I can see
>> more reasonable energy use with the eevdf-complete series. Still a
>> bit higher. Might be noise, we'll see.
>
> W00t!!!
>
> Let me write a decent Changelog and stuff it in the git tree along with
> all the other bits.
>
> Thanks for all the testing.

I've been doing some more testing of the eevdf-complete series with
the Pixel6/EAS platform. Hopefully these numbers will prove useful.

The energy regression from the original delayed-dequeue patch seems to
have been mostly fixed (with your proposed patch). Energy readings for
the big and mid cores are mostly stable and comparable with stock eevdf
(without eevdf-complete).

The only difference I can spot now is in the energy use of the little
cores. Compared to stock eevdf, the delayed-dequeue code seems to make
the energy use of the little cores a bit spiky, meaning sometimes we get
the expected level of energy use, but other times we get 40%, 60% or
even 90% more.

For instance...


(1) m6.6-eevdf-stock*: stock eevdf runs
(2) m6.6-eevdf-complete-ndd-dz*: eevdf-complete + NO_DELAY_DEQUEUE + DELAY_ZERO
(3) m6.6-eevdf-complete-dd-dz*: eevdf-complete + DELAY_DEQUEUE + DELAY_ZERO

+------------+---------------------------------+-----------+
| channel | tag | perc_diff |
+------------+---------------------------------+-----------+
| CPU-Little | m6.6-eevdf-stock-1 | 0.0% |
| CPU-Little | m6.6-eevdf-stock-2 | -4.21% |
| CPU-Little | m6.6-eevdf-stock-3 | -7.86% |
| CPU-Little | m6.6-eevdf-stock-4 | -5.67% |
| CPU-Little | m6.6-eevdf-stock-5 | -6.61% |
| CPU-Little | m6.6-eevdf-complete-ndd-dz-1 | -2.21% |
| CPU-Little | m6.6-eevdf-complete-ndd-dz-2 | -9.99% |
| CPU-Little | m6.6-eevdf-complete-ndd-dz-3 | -6.1% |
| CPU-Little | m6.6-eevdf-complete-ndd-dz-4 | -5.66% |
| CPU-Little | m6.6-eevdf-complete-ndd-dz-5 | -7.12% |
| CPU-Little | m6.6-eevdf-complete-dd-dz-1 | 96.69% |
| CPU-Little | m6.6-eevdf-complete-dd-dz-2 | 22.1% |
| CPU-Little | m6.6-eevdf-complete-dd-dz-3 | 44.82% |
| CPU-Little | m6.6-eevdf-complete-dd-dz-4 | -0.23% |
| CPU-Little | m6.6-eevdf-complete-dd-dz-5 | 8.28% |
+------------+---------------------------------+-----------+

Looking at what might explain the spiky behavior with DELAY_DEQUEUE, I
noticed the idle residency data (we have 2 idle states) also shows some
spikyness and potential clues.

Looks like (1) and (2) manage to switch to idle states in a consistent
manner, whereas (3) seems a bit erratic and more prone to take a
shallower idle state (idle 0) as opposed to a deeper idle state (idle 1).

(1) and (2) seem to make better use of the deeper idle state.

+-------------------------------+---------+------------+-------+
| tag | cluster | idle_state | time |
+-------------------------------+---------+------------+-------+
| m6.6-eevdf-stock-1 | little | not idle | 63.49 |
| m6.6-eevdf-stock-1 | little | idle 0 | 30.66 |
| m6.6-eevdf-stock-1 | little | idle 1 | 12.15 |
| m6.6-eevdf-stock-2 | little | not idle | 62.6 |
| m6.6-eevdf-stock-2 | little | idle 0 | 31.13 |
| m6.6-eevdf-stock-2 | little | idle 1 | 14.56 |
| m6.6-eevdf-stock-3 | little | not idle | 63.98 |
| m6.6-eevdf-stock-3 | little | idle 0 | 31.54 |
| m6.6-eevdf-stock-3 | little | idle 1 | 15.91 |
| m6.6-eevdf-stock-4 | little | not idle | 64.18 |
| m6.6-eevdf-stock-4 | little | idle 0 | 31.32 |
| m6.6-eevdf-stock-4 | little | idle 1 | 15.83 |
| m6.6-eevdf-stock-5 | little | not idle | 63.32 |
| m6.6-eevdf-stock-5 | little | idle 0 | 30.4 |
| m6.6-eevdf-stock-5 | little | idle 1 | 14.33 |
| m6.6-eevdf-complete-ndd-dz-1 | little | not idle | 62.62 |
| m6.6-eevdf-complete-ndd-dz-1 | little | idle 0 | 29.48 |
| m6.6-eevdf-complete-ndd-dz-1 | little | idle 1 | 13.19 |
| m6.6-eevdf-complete-ndd-dz-2 | little | not idle | 64.12 |
| m6.6-eevdf-complete-ndd-dz-2 | little | idle 0 | 27.62 |
| m6.6-eevdf-complete-ndd-dz-2 | little | idle 1 | 14.73 |
| m6.6-eevdf-complete-ndd-dz-3 | little | not idle | 62.86 |
| m6.6-eevdf-complete-ndd-dz-3 | little | idle 0 | 27.87 |
| m6.6-eevdf-complete-ndd-dz-3 | little | idle 1 | 14.97 |
| m6.6-eevdf-complete-ndd-dz-4 | little | not idle | 63.01 |
| m6.6-eevdf-complete-ndd-dz-4 | little | idle 0 | 28.2 |
| m6.6-eevdf-complete-ndd-dz-4 | little | idle 1 | 14.11 |
| m6.6-eevdf-complete-ndd-dz-5 | little | not idle | 62.1 |
| m6.6-eevdf-complete-ndd-dz-5 | little | idle 0 | 29.06 |
| m6.6-eevdf-complete-ndd-dz-5 | little | idle 1 | 14.73 |
| m6.6-eevdf-complete-dd-dz-1 | little | not idle | 46.18 |
| m6.6-eevdf-complete-dd-dz-1 | little | idle 0 | 53.78 |
| m6.6-eevdf-complete-dd-dz-1 | little | idle 1 | 3.75 |
| m6.6-eevdf-complete-dd-dz-2 | little | not idle | 57.64 |
| m6.6-eevdf-complete-dd-dz-2 | little | idle 0 | 40.47 |
| m6.6-eevdf-complete-dd-dz-2 | little | idle 1 | 7.39 |
| m6.6-eevdf-complete-dd-dz-3 | little | not idle | 43.14 |
| m6.6-eevdf-complete-dd-dz-3 | little | idle 0 | 57.73 |
| m6.6-eevdf-complete-dd-dz-3 | little | idle 1 | 3.65 |
| m6.6-eevdf-complete-dd-dz-4 | little | not idle | 58.97 |
| m6.6-eevdf-complete-dd-dz-4 | little | idle 0 | 36.4 |
| m6.6-eevdf-complete-dd-dz-4 | little | idle 1 | 9.42 |
| m6.6-eevdf-complete-dd-dz-5 | little | not idle | 55.85 |
| m6.6-eevdf-complete-dd-dz-5 | little | idle 0 | 36.75 |
| m6.6-eevdf-complete-dd-dz-5 | little | idle 1 | 13.14 |
+-------------------------------+---------+------------+-------+

I can't draw a precise conclusion, but it might be down to delayed util_est
updates or even the additional time the delayed-dequeue tasks spend on the
runqueue. But delayed-dequeue does change the overall behavior a bit on
these heterogeneous platforms, energy-wise.