Re: [PATCH] sched/deadline: Fix stale dl_defer_running in update_dl_entity() if-branch
From: Peter Zijlstra
Date: Sat Apr 04 2026 - 06:23:02 EST
On Sat, Apr 04, 2026 at 12:46:10AM +0200, Peter Zijlstra wrote:
> On Fri, Apr 03, 2026 at 12:31:19PM -0700, John Stultz wrote:
>
> > Using a 8 cpu VM with CONFIG_SCHED_PROXY_EXEC disabled:
> >
> > With commit 115135422562 ("sched/deadline: Fix 'stuck' dl_server")
> > reverted, I see the (expected, maybe) behavior where the starvation
> > lasts ~1second, then dl_server allows all the threads to spawn right
> > away, and then the test runs for 10 seconds.
> >
> > See perfetto chart:
> > https://ui.perfetto.dev/#!/?s=a729fd2dd4b224d6335c5b2e727dc1a1c302c11a
> > (click the Kernel-threads track and scroll down to see the test
> > threads named referee/defense/offense/crazy-fan)
> >
> > With commit 115135422562 ("sched/deadline: Fix 'stuck' dl_server")
> > applied, it seems the dl_server boosting the kthreadd spawning is much
> > more staggered. Again we spin up NR_CPU low priority threads, and
> > there's ~1second of starvation, then we spawn one of the mid threads,
> > and another second delay, then there's a two second delay befofe we
> > get the third running, then we get a small burst of 5 threads at once,
> > then it falls back to 1 second or more per thread as it spawns off the
> > rest. All in all it takes ~44 seconds just to spawn the threads before
> > running the test.
> >
> > Perfetto chart:
> > https://ui.perfetto.dev/#!/?s=ab8e487375d0c82ceea478ee4534a7189269c0d4
> >
> > With higher cpu counts (64), the test effectively prevents the system
> > from booting (trips the hung task watchdog).
> >
> > I haven't really diagnosed the issue, but it feels a little like the
> > dl_server is boosting until the fair rq is empty but then giving up
> > the rest of its time, so if a fair task runs repeatedly but for a very
> > short period of time, it won't get to run again until the next
> > dl_server period? Causing this rate-limiting one-task-per-second
> > effect for thread spawning? I still need to stare at the dl_server
> > logic some more.
>
> I'm getting a sense of deja-vu here. Didn't we cure this once before?
>
> I'll go stare at this somewhere next week I suppose -- we have a long
> weekend here.
Random brain wave...
Since the dl_server is LLF (deferred), it will pretty much always trip
the dl_entity_overflow() when interrupted, right? Does it make sense to
use the revised wake-up rule for it, when appropriate?
---
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d08b00429323..674de6a48551 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1027,7 +1027,7 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
if (dl_time_before(dl_se->deadline, rq_clock(rq)) ||
dl_entity_overflow(dl_se, rq_clock(rq))) {
- if (unlikely(!dl_is_implicit(dl_se) &&
+ if (unlikely((!dl_is_implicit(dl_se) || dl_se->dl_defer) &&
!dl_time_before(dl_se->deadline, rq_clock(rq)) &&
!is_dl_boosted(dl_se))) {
update_dl_revised_wakeup(dl_se, rq);