Re: [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths

From: Breno Leitao

Date: Wed May 27 2026 - 05:57:10 EST

On Wed, May 27, 2026 at 05:23:40AM +0000, Hillf Danton wrote:
> On Tue, 26 May 2026 14:08:06 -0400 Breno Leitao wrote:
> > @@ -2423,6 +2424,15 @@ static void __queue_work(int cpu, struct workqueue_struct *wq,
> >
> > out:
> > raw_spin_unlock(&pool->lock);
> > + /*
> > + * Issue the wakeup after dropping pool->lock to shorten the
> > + * locked region on this hot enqueue path. kick_pool_pick() did all
> > + * of the work that required the lock (worker selection and
> > + * wake_cpu setup); the wake_up_process() itself only needs to
> > + * take the target rq->lock.
> > + */
> > + if (wake_p)
> > + wake_up_process(wake_p);
> > rcu_read_unlock();
> > }
> >
> I suspect this works without the pool lock held, go and see workers culled
> in idle_cull_fn() for example.

Ack, you're right that wake_up_process() itself works without pool->lock
held — idle_cull_fn() is a good precedent. But sashiko's point made me
look again, and I think there's a latent bug that should be fixed before
(or as part of) this patch.

When kick_pool() picks a worker and wakes it, the worker is flipped to
TASK_RUNNING by the scheduler, but from the workqueue's point of view it
is still WORKER_IDLE and still on pool->idle_list. The transition off
the idle list only happens later, when >the woken kthread actually
schedules in, reacquires pool->lock, and runs worker_leave_idle().

That matters because idle_cull_fn() never looks at the task's scheduler
state. It doesn't read task->__state or anything equivalent — its "is
this worker idle?" check is purely "is it on pool->idle_list, and is
last_active + IDLE_WORKER_TIMEOUT in the past?". So a worker that has
been woken but hasn't yet run worker_leave_idle() is, as far as cull is
concerned, a perfectly valid victim.

The race is then:
CPU A (kicker) CPU B (cull)
---------------- ----------------
raw_spin_lock(&pool->lock)
pick worker W (idle_list head)
wake_up_process(W->task)
raw_spin_unlock(&pool->lock)
raw_spin_lock(&pool->lock)
too_many_workers() == true
W is list_last_entry(idle_list)
set_worker_dying(W, cull_list)
raw_spin_unlock(&pool->lock)
reap_dying_workers()
kthread_stop_put(W->task)
(W finally schedules in)
worker_thread():
raw_spin_lock(&pool->lock)
WORKER_DIE set -> return 0

W exits without ever running worker_leave_idle() or touching
pool->worklist, and the work item we just enqueued is stranded until the
next pool event kicks somebody else.

This window is narrow in mainline today (bounded by scheduler latency
between the kicker dropping pool->lock and the wakee reacquiring it,
plus the 5 s IDLE_WORKER_TIMEOUT precondition), but my patch widens it
considerably — process_one_work()'s deferred-wake site re-enables IRQs
and is fully preemptible across the gap.

I am still unclear how to fix it, but I am thinking about it.

--
pw-bot: cr