Re: [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed

From: John Stultz

Date: Tue Apr 28 2026 - 22:27:50 EST


On Tue, Apr 28, 2026 at 6:15 AM K Prateek Nayak <kprateek.nayak@xxxxxxx> wrote:
> On 4/28/2026 4:48 PM, Peter Zijlstra wrote:
> > On Tue, Apr 28, 2026 at 11:43:53AM +0200, Peter Zijlstra wrote:
> >> On Mon, Apr 27, 2026 at 06:38:40PM +0000, John Stultz wrote:
> >>
> >>> kernel/sched/core.c | 11 +++++++++++
> >>> 1 file changed, 11 insertions(+)
> >>>
> >>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >>> index da20fb6ea25ae..5f684caefd8b2 100644
> >>> --- a/kernel/sched/core.c
> >>> +++ b/kernel/sched/core.c
> >>> @@ -7097,6 +7097,17 @@ static void __sched notrace __schedule(int sched_mode)
> >>> try_to_block_task(rq, prev, &prev_state,
> >>> !task_is_blocked(prev));
> >>> switch_count = &prev->nvcsw;
> >>> + } else if (preempt && prev->blocked_on) {
> >>> + /*
> >>> + * If we are SM_PREEMPT, we may have interrupted
> >>> + * after blocked_on was set, before schedule()
> >>> + * was run, preventing workques from running. So
> >>
> >> workqueues
> >>
> >>> + * clear blocked_on and mark task RUNNING so it
> >>> + * can be reselected to run and complete its
> >>> + * logic
> >>> + */
> >>> + WRITE_ONCE(prev->__state, TASK_RUNNING);
> >>> + clear_task_blocked_on(prev, NULL);
> >>> }
> >>>
> >>> pick_again:
> >>
> >> *groan*, this feels wrong. Preemption should never touch state. Let me
> >> try and wake up and make sense of this.
> >
> > So all non-special block states *SHOULD* be in a loop and handle
> > spurious wakeups -- I fixed a pile of offenders some many years ago, but
> > there really isn't anything in the kernel that validates this.
> >
> > [ I suppose someone could try and do a cocci test for this? ]
> >
> > Any wait for non-special states that is not a loop is fundamentally
> > broken, since many of the lock wake-up paths are explicitly racy in that
> > they can cause spurious wakeups (which is the safe side of the race,
> > since insufficient wakeups is bad etc.).
> >
> > OTOH special states, are special, esp. because they cannot handle
> > spurious wakeups.
> >
> > Eg, consider something like:
> >
> > set_current_state(TASK_FROZEN)
> > <PREEMPT>
> > current->__state = TASK_RUNNING
> > </PREEMPT/
> > schedule();
> >
> > is all sorts of broken. Now, obiously special states must never have
> > blocked_on set, so this can be fudged about. But still, touching __state
> > from schedule seems wrong.
> >
> > Anyway, the historical distinction between a blocked task and a
> > preempted task is that the blocked task is not on the runqueue, while
> > the preempted task is kept on the runqueue.
> >
> > Obviously PE wrecks this, and hence the problem. And yeah, amazing we
> > never hit this before.
> >
> > Something like so perhaps?
> >
> > ---
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 368c7b4d7cb5..0bd5da8360f3 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -846,7 +846,11 @@ struct task_struct {
> > struct alloc_tag *alloc_tag;
> > #endif
> >
> > - int on_cpu;
> > + u8 on_cpu;
> > + u8 on_rq;
> > + u8 is_blocked;
> > + u8 __pad;
> > +
> > struct __call_single_node wake_entry;
> > unsigned int wakee_flips;
> > unsigned long wakee_flip_decay_ts;
> > @@ -861,7 +865,6 @@ struct task_struct {
> > */
> > int recent_used_cpu;
> > int wake_cpu;
> > - int on_rq;
> >
> > int prio;
> > int static_prio;
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index da20fb6ea25a..06817ae0cbd9 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -615,6 +615,13 @@ EXPORT_SYMBOL(__trace_set_current_state);
> > * [ The astute reader will observe that it is possible for two tasks on one
> > * CPU to have ->on_cpu = 1 at the same time. ]
> > *
> > +* p->is_blocked <- { 0, 1 }:
> > +*
> > +* is set by block_task() and cleared by ttwu_do_activate() and indicates
> > +* this task is blocked, as opposed to runnable. Used to distinguish between
> > +* preempted and blocked tasks for proxy exec, which keeps everything on the
> > +* runqueue.
> > + *
> > * task_cpu(p): is changed by set_task_cpu(), the rules are:
> > *
> > * - Don't call set_task_cpu() on a blocked task:
> > @@ -2225,6 +2232,7 @@ void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
> >
> > static void block_task(struct rq *rq, struct task_struct *p, int flags)
> > {
> > + p->is_blocked = 1;
>
> We never reach here with PROXY_EXEC. Instead we bail out in the caller
> try_to_block_task() ...
>
> > if (dequeue_task(rq, p, DEQUEUE_SLEEP | flags))
> > __block_task(rq, p);
> > }
> > @@ -3722,6 +3730,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
> > atomic_dec(&task_rq(p)->nr_iowait);
> > }
> >
> > + p->is_blocked = 0;
> > activate_task(rq, p, en_flags);
> > wakeup_preempt(rq, p, wake_flags);
> >
> > @@ -7107,7 +7116,7 @@ static void __sched notrace __schedule(int sched_mode)
> > struct task_struct *prev_donor = rq->donor;
> >
> > rq_set_donor(rq, next);
> > - if (unlikely(next->blocked_on)) {
> > + if (unlikely(next->is_blocked && next->blocked_on)) {
>
> ... so ->is_blocked here is always false for proxy tasks retained on
> the runqueue.
>
> I was trying something like below but I'm somewhere missing a
> clear_task_blocked_on() for PROXY_WAKING before going back into
> mutex_lock_common():
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8ec3b6d7d718b..6ea74aecc5fbd 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -586,6 +586,7 @@ struct sched_entity {
> unsigned char sched_delayed;
> unsigned char rel_deadline;
> unsigned char custom_slice;
> + unsigned char sched_proxy;
> /* hole */

I feel like this is so tied to the blocked_on value, I suspect it
makes the most sense to have this flag be the low bit of that pointer?

Sort of a blocked_on latch, to signal its really in effect?

Plus it gets cleared automatically on set and clear, so it looks a
little cleaner.


> @@ -6535,8 +6536,13 @@ static bool try_to_block_task(struct rq *rq, struct task_struct *p,
> * blocked on a mutex, and we want to keep it on the runqueue
> * to be selectable for proxy-execution.
> */
> - if (!should_block)
> + if (!should_block) {
> + guard(raw_spinlock)(&p->blocked_lock);
> + /* Stable against race */
> + if (task_is_blocked(p))
> + WRITE_ONCE(p->se.sched_proxy, 1);
> return false;
> + }

So if we double check and find the task isn't blocked anymore, we
probably shouldn't return early here, no?

Let me take a stab at the bit flag approach and see how it goes.

thanks
-john