Re: [PATCH] sched: Fix psi_dequeue for Proxy Execution

From: Johannes Weiner

Date: Fri Nov 21 2025 - 06:54:49 EST


On Tue, Nov 18, 2025 at 05:52:23AM +0000, John Stultz wrote:
> Currently, if the sleep flag is set, psi_dequeue() doesn't
> change any of the psi_flags.
>
> This is because psi_switch_task() will clear TSK_ONCPU as well
> as other potential flags (TSK_RUNNING), and the assumption is
> that a voluntary sleep always consists of a task being dequeued
> followed shortly there after with a psi_sched_switch() call.
>
> Proxy Execution changes this expectation, as mutex-blocked tasks
> that would normally sleep stay on the runqueue. But in the case
> where the mutex-owning task goes to sleep, or the owner is on a
> remote cpu, we will then deactivate the blocked task shortly
> after.
>
> In that situation, the mutex-blocked task will have had its
> TSK_ONCPU cleared when it was switched off the cpu, but it will
> stay TSK_RUNNING. Then if we later dequeue it (as currently done
> if we hit a case find_proxy_task() can't yet handle, such as the
> case of the owner being on another rq or a sleeping owner)
> psi_dequeue() won't change any state (leaving it TSK_RUNNING),
> as it incorrectly expects a psi_task_switch() call to
> immediately follow.
>
> Later on when the task get woken/re-enqueued, and psi_flags are
> set for TSK_RUNNING, we hit an error as the task is already
> TSK_RUNNING:
> psi: inconsistent task state! task=188:kworker/28:0 cpu=28 psi_flags=4 clear=0 set=4
>
> To resolve this, extend the logic in psi_dequeue() so that
> if the sleep flag is set, we also check if psi_flags have
> TSK_ONCPU set (meaning the psi_task_switch is imminent) before
> we do the shortcut return.
>
> If TSK_ONCPU is not set, that means we've already switched away,
> and this psi_dequeue call needs to clear the flags.
>
> Fixes: be41bde4c3a8 ("sched: Add an initial sketch of the find_proxy_task() function")
> Reported-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
> Closes: https://lore.kernel.org/lkml/20251117185550.365156-1-kprateek.nayak@xxxxxxx/
> Signed-off-by: John Stultz <jstultz@xxxxxxxxxx>
> Tested-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
> ---
> v13:
> * Reworked for collision
> v15:
> * Fixed commit message typo noticed by Todd Kjos
> v24:
> * Reworded commit message in response to K Prateek pointing
> out this issue can affect us earlier in the full proxy
> series then I had anticipated.
>
> Cc: Joel Fernandes <joelagnelf@xxxxxxxxxx>
> Cc: Qais Yousef <qyousef@xxxxxxxxxxx>
> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Cc: Juri Lelli <juri.lelli@xxxxxxxxxx>
> Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> Cc: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
> Cc: Valentin Schneider <vschneid@xxxxxxxxxx>
> Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
> Cc: Suren Baghdasaryan <surenb@xxxxxxxxxx>
> Cc: Steven Rostedt <rostedt@xxxxxxxxxxx>
> Cc: Ben Segall <bsegall@xxxxxxxxxx>
> Cc: Zimuzo Ezeozue <zezeozue@xxxxxxxxxx>
> Cc: Mel Gorman <mgorman@xxxxxxx>
> Cc: Will Deacon <will@xxxxxxxxxx>
> Cc: Waiman Long <longman@xxxxxxxxxx>
> Cc: Boqun Feng <boqun.feng@xxxxxxxxx>
> Cc: "Paul E. McKenney" <paulmck@xxxxxxxxxx>
> Cc: Metin Kaya <Metin.Kaya@xxxxxxx>
> Cc: Xuewen Yan <xuewen.yan94@xxxxxxxxx>
> Cc: K Prateek Nayak <kprateek.nayak@xxxxxxx>
> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Cc: Daniel Lezcano <daniel.lezcano@xxxxxxxxxx>
> Cc: Suleiman Souhlal <suleiman@xxxxxxxxxx>
> Cc: kuyo chang <kuyo.chang@xxxxxxxxxxxx>
> Cc: hupu <hupu.gm@xxxxxxxxx>
> Cc: kernel-team@xxxxxxxxxxx
> ---
> kernel/sched/stats.h | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
> index 26f3fd4d34cea..a38459813b537 100644
> --- a/kernel/sched/stats.h
> +++ b/kernel/sched/stats.h
> @@ -180,8 +180,12 @@ static inline void psi_dequeue(struct task_struct *p, int flags)
> * avoid walking all ancestors twice, psi_task_switch() handles
> * TSK_RUNNING and TSK_IOWAIT for us when it moves TSK_ONCPU.
> * Do nothing here.

Newline here for new paragraph?

> + * In the SCHED_PROXY_EXECUTION case we may do sleeping
> + * dequeues that are not followed by a task switch, so check
> + * TSK_ONCPU is set to ensure the task switch is imminent.
> + * Otherwise clear the flags as usual.
> */
> - if (flags & DEQUEUE_SLEEP)
> + if ((flags & DEQUEUE_SLEEP) && (p->psi_flags & TSK_ONCPU))
> return;

Otherwise, looks good to me. Thanks for the detailed explanation in
the changelog!

Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx>