Re: [PATCH 1/2] sched: proxy-exec: Close race causing workqueue work being delayed
From: John Stultz
Date: Thu Apr 30 2026 - 16:40:41 EST
On Tue, Apr 28, 2026 at 4:18 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Tue, Apr 28, 2026 at 11:43:53AM +0200, Peter Zijlstra wrote:
> > On Mon, Apr 27, 2026 at 06:38:40PM +0000, John Stultz wrote:
> >
> > > kernel/sched/core.c | 11 +++++++++++
> > > 1 file changed, 11 insertions(+)
> > >
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index da20fb6ea25ae..5f684caefd8b2 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -7097,6 +7097,17 @@ static void __sched notrace __schedule(int sched_mode)
> > > try_to_block_task(rq, prev, &prev_state,
> > > !task_is_blocked(prev));
> > > switch_count = &prev->nvcsw;
> > > + } else if (preempt && prev->blocked_on) {
> > > + /*
> > > + * If we are SM_PREEMPT, we may have interrupted
> > > + * after blocked_on was set, before schedule()
> > > + * was run, preventing workques from running. So
> >
> > workqueues
> >
> > > + * clear blocked_on and mark task RUNNING so it
> > > + * can be reselected to run and complete its
> > > + * logic
> > > + */
> > > + WRITE_ONCE(prev->__state, TASK_RUNNING);
> > > + clear_task_blocked_on(prev, NULL);
> > > }
> > >
> > > pick_again:
> >
> > *groan*, this feels wrong. Preemption should never touch state. Let me
> > try and wake up and make sense of this.
>
> So all non-special block states *SHOULD* be in a loop and handle
> spurious wakeups -- I fixed a pile of offenders some many years ago, but
> there really isn't anything in the kernel that validates this.
>
> [ I suppose someone could try and do a cocci test for this? ]
>
> Any wait for non-special states that is not a loop is fundamentally
> broken, since many of the lock wake-up paths are explicitly racy in that
> they can cause spurious wakeups (which is the safe side of the race,
> since insufficient wakeups is bad etc.).
>
> OTOH special states, are special, esp. because they cannot handle
> spurious wakeups.
>
> Eg, consider something like:
>
> set_current_state(TASK_FROZEN)
> <PREEMPT>
> current->__state = TASK_RUNNING
> </PREEMPT/
> schedule();
>
> is all sorts of broken. Now, obiously special states must never have
> blocked_on set, so this can be fudged about. But still, touching __state
> from schedule seems wrong.
Hey Peter,
Thanks again for the background here and the alternative proposal
that K Prateek and I have iterated on.
I did want to clarify one case here just for my understanding. The
approach in this first proposal was sort of leaning on the similar
pattern in the signal_pending_state() case in try_to_block_task(). Is
that just a hard exception to the rule here?
thanks
-john