Re: [REGRESSION] Re: [PATCH 00/24] Complete EEVDF

From: Peter Zijlstra
Date: Mon Jan 06 2025 - 06:57:46 EST


On Sun, Dec 29, 2024 at 02:51:43PM -0800, Doug Smythies wrote:
> Hi Peter,
>
> I have been having trouble with turbostat reporting processor package power levels that can not possibly be true.
> After eliminating the turbostat program itself as the source of the issue I bisected the kernel.
> An edited summary (actual log attached):
>
> 82e9d0456e06 sched/fair: Avoid re-setting virtual deadline on 'migrations'
> b10 bad fc1892becd56 sched/eevdf: Fixup PELT vs DELAYED_DEQUEUE
> b13 bad 54a58a787791 sched/fair: Implement DELAY_ZERO
> skip 152e11f6df29 sched/fair: Implement delayed dequeue
> skip e1459a50ba31 sched: Teach dequeue_task() about special task states
> skip a1c446611e31 sched,freezer: Mark TASK_FROZEN special
> skip 781773e3b680 sched/fair: Implement ENQUEUE_DELAYED
> skip f12e148892ed sched/fair: Prepare pick_next_task() for delayed dequeue
> skip 2e0199df252a sched/fair: Prepare exit/cleanup paths for delayed_dequeue
> b12 good e28b5f8bda01 sched/fair: Assert {set_next,put_prev}_entity() are properly balanced
> dfa0a574cbc4 sched/uclamg: Handle delayed dequeue
> b11 good abc158c82ae5 sched: Prepare generic code for delayed dequeue
> e8901061ca0c sched: Split DEQUEUE_SLEEP from deactivate_task()
>
> Where "bN" is just my assigned kernel name for each bisection step.
>
> In the linux-kernel email archives I found a thread that isolated these same commits.
> It was from late Novermebr / early December:
>
> https://lore.kernel.org/all/20240727105030.226163742@xxxxxxxxxxxxx/T/#m9aeb4d897e029cf7546513bb09499c320457c174
>
> An example of the turbostat manifestation of the issue:
>
> doug@s19:~$ sudo ~/kernel/linux/tools/power/x86/turbostat/turbostat --quiet --Summary --show
> Busy%,Bzy_MHz,IRQ,PkgWatt,PkgTmp,TSC_MHz --interval 1
> [sudo] password for doug:
> Busy% Bzy_MHz TSC_MHz IRQ PkgTmp PkgWatt
> 99.76 4800 4104 12304 73 80.08
> 99.76 4800 4104 12047 73 80.23
> 99.76 4800 879 12157 73 11.40
> 99.76 4800 26667 84214 72 557.23
> 99.76 4800 4104 12036 72 79.39
>
> Where TSC_MHz was reported as 879, there was a big gap in time.
> Like 4.7 seconds instead of 1.
> Where TSC_MHz was reported as 26667, there was not a big gap in time.
>
> It happens for about 5% of the samples + or - a lot.
> It only happens when the workload is almost exactly 100%.
> More load, it doesn't occur.
> Less load, it doesn't occur. Although, I did get this once:
>
> Busy% Bzy_MHz TSC_MHz IRQ PkgTmp PkgWatt
> 91.46 4800 4104 11348 73 103.98
> 91.46 4800 4104 11353 73 103.89
> 91.50 4800 3903 11339 73 98.16
> 91.43 4800 4271 12001 73 108.52
> 91.45 4800 4148 11481 73 105.13
> 91.46 4800 4104 11341 73 103.96
> 91.46 4800 4104 11348 73 103.99
>
> So, it might just be much less probable and less severe.
>
> It happens over many different types of workload that I have tried.

In private email you've communicated it happens due to
sched_setaffinity() sometimes taking multiple seconds.

I'm trying to reproduce by starting a bash 'while ;: do :; done' spinner
for each CPU, but so far am not able to reproduce.

What is the easiest 100% load you're seeing this with?