Re: [PATCH 2/2] workqueue: defer wake_up_process() outside pool->lock on hot paths

From: Breno Leitao

Date: Wed May 27 2026 - 11:00:21 EST


On Wed, May 27, 2026 at 10:48:12AM +0100, Breno Leitao wrote:
> I am still unclear how to fix it, but I am thinking about it.

I would say the best approach is to take the worker from the idle list
before waking it up.

worker_leave_idle(worker);
wake_up_process(p);

Not later when the worker is scheduled (worker_thread()).


commit f0a524a156b7563b9b37242aa42c822f7a8256e2
Author: Breno Leitao <leitao@xxxxxxxxxx>
Date: Wed May 27 07:39:37 2026 -0700

workqueue: claim picked idle worker under pool->lock in kick_pool()

kick_pool() picks an idle worker via first_idle_worker(), optionally
adjusts the wake_cpu and repatriates it, then wakes it via
wake_up_process() -- but it does not remove the picked worker from
pool->idle_list and does not clear WORKER_IDLE. Those transitions
happen later in worker_thread:woke_up: when the woken kthread
reacquires pool->lock and calls worker_leave_idle().

idle_cull_fn() walks pool->idle_list from the tail and calls
set_worker_dying() on entries whose last_active + IDLE_WORKER_TIMEOUT
has elapsed. set_worker_dying() only checks WORKER_IDLE; it does
not inspect task->__state. A worker that has been kicked but has
not yet rerun worker_leave_idle() therefore remains a valid cull
victim. If cull races in between the kick and the wakee reacquiring
pool->lock, the worker gets WORKER_DIE set and is reaped by
kthread_stop_put(). On wakeup it sees WORKER_DIE in worker_thread()
and returns 0 without consuming pool->worklist, stranding the
just-enqueued work item until the next pool event kicks somebody
else.

CPU A (kicker) CPU B (cull)
---------------- ----------------
raw_spin_lock(&pool->lock)
pick worker W (idle_list head)
wake_up_process(W->task)
raw_spin_unlock(&pool->lock)
raw_spin_lock(&pool->lock)
too_many_workers() == true
W is list_last_entry(idle_list)
set_worker_dying(W, cull_list)
raw_spin_unlock(&pool->lock)
reap_dying_workers()
kthread_stop_put(W->task)
(W finally schedules in)
worker_thread():
raw_spin_lock(&pool->lock)
WORKER_DIE set -> return 0

Today the window is bounded by scheduler latency between the kicker
dropping pool->lock and the wakee reacquiring it, plus the 5 minute
IDLE_WORKER_TIMEOUT precondition. Deferring wake_up_process()
outside pool->lock on the hot enqueue / process_one_work() paths
widens it considerably (IRQs re-enabled, fully preemptible).

Close the window by claiming the picked worker under pool->lock:
call worker_leave_idle() in kick_pool() before wake_up_process().
That removes the worker from pool->idle_list and clears WORKER_IDLE
while we still hold the lock, so set_worker_dying()'s WORKER_IDLE
precondition is no longer satisfied and idle_cull_fn()'s tail walk
cannot see the worker. Gate the existing worker_leave_idle() call
in worker_thread:woke_up: behind WORKER_IDLE, so first-time wakeups
from create_worker() (which do not go through kick_pool()) still
transition correctly.

Signed-off-by: Breno Leitao <leitao@xxxxxxxxxx>

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 8df671066dd1..321df19f9fd8 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1310,6 +1310,18 @@ static bool kick_pool(struct worker_pool *pool)
}
}
#endif
+ /*
+ * Claim @worker under pool->lock so it cannot race idle_cull_fn():
+ * once @worker is off pool->idle_list and no longer WORKER_IDLE,
+ * set_worker_dying() will skip it and the cull walk cannot reach
+ * it. Otherwise, the woken worker remains WORKER_IDLE on
+ * pool->idle_list until it actually schedules in and runs
+ * worker_leave_idle() in worker_thread:woke_up:, leaving a window
+ * where a concurrent idle_cull_fn() can flag it WORKER_DIE and
+ * kthread_stop_put() it before it consumes pool->worklist,
+ * stranding the just-enqueued work.
+ */
+ worker_leave_idle(worker);
wake_up_process(p);
return true;
}
@@ -3447,7 +3459,13 @@ static int worker_thread(void *__worker)
return 0;
}

- worker_leave_idle(worker);
+ /*
+ * Kicked workers have already been removed from pool->idle_list
+ * by kick_pool(); only first-time wakeups (via create_worker())
+ * still arrive with WORKER_IDLE set.
+ */
+ if (worker->flags & WORKER_IDLE)
+ worker_leave_idle(worker);
recheck:
/* no more worker necessary? */
if (!need_more_worker(pool))