Re: [REGRESSION] sched/deadline: Hard lockup during CPU offline after commit 14a857056466

From: Thorsten Leemhuis

Date: Wed May 27 2026 - 04:25:05 EST


On 5/18/26 09:43, juri.lelli@xxxxxxxxxx wrote:
> On 16/05/26 03:07, batcain wrote:
>> [1.] One line summary of the problem:sched/deadline: Hard lockup
>> during CPU offline/migration due to frozen rq_clock loop in
>> update_dl_revised_wakeup()
>> [...]
>> However, under the stop_machine() noirq phase, the runqueue clock is
>> stale/frozen. Since the clock does not progress across iterations
>> within the enqueue loop, the mathematical state stalls. Consequently,
>> dl_entity_overflow() continuously evaluates to true, trapping the
>> processor core in an infinite loop inside the enqueue path, resulting
>> in a system-wide hard lockup.
>
> I cannot immediately see how this issue can affect dl-server(s), as they
> cannot migrate and are de-activated on CPUs going offline.
>
> [...]
>> [8.] Environment description (Hardware, distribution, etc.): Hardware:
>> Confirmed on both AMD Zen 2 (Renoir) and AMD Zen 4 (Phoenix)
>> platforms. Distribution: Arch Linux (using official
>> extra/linux-hardened kernel package).
>
> Also cannot reproduce at my end.
So how to move on here?

Side note: linux-hardened uses hardening patches, which raises the
question if those are the problem. Batcain did you do the bisection with
vanilla?

Another side note: a fix for the patch is the changelog was posted,
wonder if it might be related (reminder: not my area of expertise, so IO
might be misleading everyone here by mentioning it):

sched/deadline: Use revised wakeup rule only for running dl_server
https://lore.kernel.org/lkml/20260522125833.264145-1-gmonaco@xxxxxxxxxx/

Ciao, Thorsten