RE: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF

From: Doug Smythies
Date: Fri Jan 10 2025 - 00:09:39 EST


Hi Peter,

Thanks for all your hard work on this.

On 2025.01.09 03:00 Peter Zijlstra wrote:

...

> This made me have a very hard look at reweight_entity(), and
> specifically the ->on_rq case, which is more prominent with
> DELAY_DEQUEUE.
>
> And indeed, it is all sorts of broken. While the computation of the new
> lag is correct, the computation for the new vruntime, using the new lag
> is broken for it does not consider the logic set out in place_entity().
>
> With the below patch, I now see things like:
>
> migration/12-55 [012] d..3. 309.006650: reweight_entity: (ffff8881e0e6f600-ffff88885f235f40-12)
> { weight: 977582 avg_vruntime: 4860513347366 vruntime: 4860513347908 (-542) deadline: 4860516552475
} ->
> { weight: 2 avg_vruntime: 4860528915984 vruntime: 4860793840706 (-264924722) deadline: 6427157349203
}
> migration/14-62 [014] d..3. 309.006698: reweight_entity: (ffff8881e0e6cc00-ffff88885f3b5f40-15)
> { weight: 2 avg_vruntime: 4874472992283 vruntime: 4939833828823 (-65360836540) deadline:
6316614641111 } ->
> { weight: 967149 avg_vruntime: 4874217684324 vruntime: 4874217688559 (-4235) deadline: 4874220535650
}
>
> Which isn't perfect yet, but much closer.

Agreed.
I tested the patch. Attached is a repeat of a graph I had sent before, with different y axis scale and old data deleted.
It still compares to the "b12" kernel (the last good one in the kernel bisection).
It was a 2 hour and 31 minute duration test, and the maximum CPU migration time was 24 milliseconds,
verses 6 seconds without the patch.

I left things running for many hours and will let it continue overnight.
There seems to have been an issue at one spot in time:

usec Time_Of_Day_Seconds CPU Busy% IRQ
488994 1736476550.732222 - 99.76 12889
488520 1736476550.732222 11 99.76 1012
960999 1736476552.694222 - 99.76 17922
960587 1736476552.694222 11 99.76 1493
914999 1736476554.610222 - 99.76 23579
914597 1736476554.610222 11 99.76 1962
809999 1736476556.421222 - 99.76 23134
809598 1736476556.421222 11 99.76 1917
770998 1736476558.193221 - 99.76 21757
770603 1736476558.193221 11 99.76 1811
726999 1736476559.921222 - 99.76 21294
726600 1736476559.921222 11 99.76 1772
686998 1736476561.609221 - 99.76 20801
686600 1736476561.609221 11 99.76 1731
650998 1736476563.261221 - 99.76 20280
650601 1736476563.261221 11 99.76 1688
610998 1736476564.873221 - 99.76 19857
610606 1736476564.873221 11 99.76 1653

I had one of these the other day also, but they were all 6 seconds.
Its like a burst of problematic data. I have the data somewhere,
and can try to find it tomorrow.

>
> Fixes: eab03c23c2a1 ("sched/eevdf: Fix vruntime adjustment on reweight")
> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>

...

Attachment: turbostat-sampling-issue-fixed-seconds.png
Description: PNG image