Re: [PATCH v3 2/3] workqueue: defer the worker wakeup outside pool->lock in __queue_work()

From: Sebastian Andrzej Siewior

Date: Wed Jun 24 2026 - 04:48:25 EST


On 2026-06-16 06:33:32 [-0700], Breno Leitao wrote:
> __queue_work() is the enqueue hot path: it inserts the work item and
> calls kick_pool() while holding pool->lock. kick_pool() ends in a
> wakeup, which takes the target task's rq->lock, so rq->lock nests under
> pool->lock on every enqueue that wakes a worker on a contended unbound
> pool.
>
> Use kick_pool_pick() to select and claim the worker under pool->lock,
> queue it on an on-stack wake_q, and issue the wakeup with wake_up_q()
> right after dropping the lock via raw_spin_unlock_wake(). Worker
> selection, wake_cpu setup and claiming the worker off pool->idle_list
> still happen under the lock; only the rq->lock acquisition moves out.
>
> Signed-off-by: Breno Leitao <leitao@xxxxxxxxxx>

> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c

> out:
> - raw_spin_unlock(&pool->lock);
> + raw_spin_unlock_wake(&pool->lock, &wakeq);
> rcu_read_unlock();
> }

It is not wrong but I am not sure if this is really needed here. The
pattern
preempt_disable();
raw_spin_unlock();
wake_up_q()
preempt_enable();

is used to prevent task preemption after the unlock operation. The
futex/ locking code needs to wake a task but before the unlock operation
the task priority might have been lowered as result of dropping the
lock. This means it might not be the task with the highest priority in
the system and the task with highest is not yet active and we schedule a
task in middle in the instead.
To form this easier: say we have Task A prio=1, B prio=2 and C prio=3
with higher number higher priority.
A owns a lock and is preempted by B. C gets on the CPU preempts B. C
wants A's lock so it passes its priority to A (PI-boost) and goes idle.
A gets on the CPU and unlocks.
Now: As part of the unlock operation (before the raw_spin_unlock()) A
goes back to its initial priority and C is not yet woken up meaning B is
the task with highest priority and the held raw_spinlock_t is the only
thing preventing scheduling. So we disable preemption, unlock and then
perform the wake-up so C becomes the next candidate (as it should be) to
occupy the CPU.
Otherwise it would be B which means per definition a task with lower
priority runs before a task with higher priority.

I don't think workqueue has this requirements here. Worst case something
else gets on the CPU and worker wakeup is delayed until the task is
scheduled again. It could be fine since the preemption could happen
before queue-work().
It probably does not lead to a huge performance
regression but who knows.

Your goal was to lower the contention on the pool lock during the wake
up which you achieved. The __queue_work() remains still not preemptible
until after the wake up which might be fine.

Sebastian