Re: [PATCH] sched/deadline: Fix stale dl_defer_running in dl_server else-branch
From: John Stultz
Date: Thu Apr 02 2026 - 20:06:32 EST
On Thu, Apr 2, 2026 at 6:30 AM <soolaugust@xxxxxxxxx> wrote:
>
> From: Zhidao Su <suzhidao@xxxxxxxxxx>
>
> Peter's fix (115135422562) cleared dl_defer_running in the if-branch of
> update_dl_entity() (deadline expired/overflow). This ensures
> replenish_dl_new_period() always arms the zero-laxity timer. However,
> with PROXY_WAKING, re-activation hits the else-branch (same-period,
> deadline not expired), where dl_defer_running from a prior starvation
> episode can be stale.
>
> During PROXY_WAKING CPU return-migration, proxy_force_return() migrates
> the task to a new CPU via deactivate_task()+attach_one_task(). The
> enqueue path on the new CPU triggers enqueue_task_fair() which calls
> dl_server_start() for the fair_server. Crucially, this re-activation
> does NOT call dl_server_stop() first, so dl_defer_running retains its
> prior value. If a prior starvation episode left dl_defer_running=1,
> and the server is re-activated within the same period:
>
> [4] D->A: dl_server_stop() clears flags but may be skipped when
> dl_server_active=0 (server was already stopped before
> return-migration triggered dl_server_start())
> [1] A->B: dl_server_start() -> enqueue_dl_entity(WAKEUP)
> -> update_dl_entity() enters else-branch
> -> 'if (!dl_defer_running)' guard fires, skips
> dl_defer_armed=1 / dl_throttled=1
> -> server enqueued into [D] state directly
> -> update_curr_dl_se() consumes runtime
> -> start_dl_timer() with dl_defer_armed=0 (slow path)
> -> boot time increases ~72%
>
> Fix: in the else-branch, unconditionally clear dl_defer_running and always
> set dl_defer_armed=1 / dl_throttled=1. This ensures every same-period
> re-activation properly re-arms the zero-laxity timer, regardless of whether
> a prior starvation episode had set dl_defer_running.
>
> The if-branch (deadline expired) is left untouched:
> replenish_dl_new_period() contains its own guard ('if (!dl_defer_running)')
> that arms the zero-laxity timer only when dl_defer_running=0. With
> PROXY_WAKING, dl_defer_running=1 in the deadline-expired path means a
> genuine starvation episode is ongoing, so the server can skip the
> zero-laxity wait and enter [D] directly. Clearing dl_defer_running here
> (as Peter's fix did) forces every PROXY_WAKING deadline-expired
> re-activation through the ~950ms zero-laxity wait.
>
> Measured boot time to first ksched_football event (4 CPUs, 4G):
> This fix: ~15-20s
> Without fix (stale dl_defer_running): ~43-62s (+72-200%)
>
> Note: Andrea Righi's v2 patch addresses the same symptom by clearing
> dl_defer_running in dl_server_stop(). However, dl_server_stop() is not
> called during PROXY_WAKING return-migration (proxy_force_return() calls
> dl_server_start() directly without dl_server_stop()). This fix targets
> the correct location: the else-branch of update_dl_entity().
>
> Signed-off-by: Zhidao Su <suzhidao@xxxxxxxxxx>
Oh, this is perfect! I've noticed the performance regression
previously and narrowed it down to commit 115135422562
("sched/deadline: Fix 'stuck' dl_server"), but I hadn't quite gotten
my head around the issue. In testing, your patch seems to resolve the
regression as well as the revert I was doing previously.
I've included your patch in the series I'm hoping to send out soon here.
Thanks so much!
-john