Re: [PATCH] sched/deadline: Fix stale dl_defer_running in dl_server else-branch

From: John Stultz

Date: Thu Apr 02 2026 - 21:32:26 EST

On Thu, Apr 2, 2026 at 5:05 PM John Stultz <jstultz@xxxxxxxxxx> wrote:
> On Thu, Apr 2, 2026 at 6:30 AM <soolaugust@xxxxxxxxx> wrote:
> >
> > From: Zhidao Su <suzhidao@xxxxxxxxxx>
> >
> > Peter's fix (115135422562) cleared dl_defer_running in the if-branch of
> > update_dl_entity() (deadline expired/overflow). This ensures
> > replenish_dl_new_period() always arms the zero-laxity timer. However,
> > with PROXY_WAKING, re-activation hits the else-branch (same-period,
> > deadline not expired), where dl_defer_running from a prior starvation
> > episode can be stale.
> >
> > During PROXY_WAKING CPU return-migration, proxy_force_return() migrates
> > the task to a new CPU via deactivate_task()+attach_one_task(). The
> > enqueue path on the new CPU triggers enqueue_task_fair() which calls
> > dl_server_start() for the fair_server. Crucially, this re-activation
> > does NOT call dl_server_stop() first, so dl_defer_running retains its
> > prior value. If a prior starvation episode left dl_defer_running=1,
> > and the server is re-activated within the same period:
> >
> > [4] D->A: dl_server_stop() clears flags but may be skipped when
> > dl_server_active=0 (server was already stopped before
> > return-migration triggered dl_server_start())
> > [1] A->B: dl_server_start() -> enqueue_dl_entity(WAKEUP)
> > -> update_dl_entity() enters else-branch
> > -> 'if (!dl_defer_running)' guard fires, skips
> > dl_defer_armed=1 / dl_throttled=1
> > -> server enqueued into [D] state directly
> > -> update_curr_dl_se() consumes runtime
> > -> start_dl_timer() with dl_defer_armed=0 (slow path)
> > -> boot time increases ~72%
> >
> > Fix: in the else-branch, unconditionally clear dl_defer_running and always
> > set dl_defer_armed=1 / dl_throttled=1. This ensures every same-period
> > re-activation properly re-arms the zero-laxity timer, regardless of whether
> > a prior starvation episode had set dl_defer_running.
> >
> > The if-branch (deadline expired) is left untouched:
> > replenish_dl_new_period() contains its own guard ('if (!dl_defer_running)')
> > that arms the zero-laxity timer only when dl_defer_running=0. With
> > PROXY_WAKING, dl_defer_running=1 in the deadline-expired path means a
> > genuine starvation episode is ongoing, so the server can skip the
> > zero-laxity wait and enter [D] directly. Clearing dl_defer_running here
> > (as Peter's fix did) forces every PROXY_WAKING deadline-expired
> > re-activation through the ~950ms zero-laxity wait.
> >
> > Measured boot time to first ksched_football event (4 CPUs, 4G):
> > This fix: ~15-20s
> > Without fix (stale dl_defer_running): ~43-62s (+72-200%)
> >
> > Note: Andrea Righi's v2 patch addresses the same symptom by clearing
> > dl_defer_running in dl_server_stop(). However, dl_server_stop() is not
> > called during PROXY_WAKING return-migration (proxy_force_return() calls
> > dl_server_start() directly without dl_server_stop()). This fix targets
> > the correct location: the else-branch of update_dl_entity().
> >
> > Signed-off-by: Zhidao Su <suzhidao@xxxxxxxxxx>
>
> Oh, this is perfect! I've noticed the performance regression
> previously and narrowed it down to commit 115135422562
> ("sched/deadline: Fix 'stuck' dl_server"), but I hadn't quite gotten
> my head around the issue. In testing, your patch seems to resolve the
> regression as well as the revert I was doing previously.

Oh drat, unfortunately I was testing without the ksched_football test
applied, and unfortunately this change isn't resolving the issue
(basically see the ksched_football test seemingly stop making progress
on boot, seemingly hanging the system).

So it doesn't seem this is sufficient. I'll continue working to
understand the issue and will use your hint about calling
dl_server_stop() maybe in the return migration path.

thanks
-john