Re: [PATCH v3 0/3] workqueue: Shrink the lock time

From: Krishna Magar

Date: Thu Jun 25 2026 - 03:38:37 EST

Hello,

I tested v3 of this patch series on a 2-socket Intel Xeon Gold 6330
system (56 physical cores, 112 threads) using a buffered random write
fio workload on a 4-NVMe RAID0 configuration.
Each socket contains a 42 MB LLC shared by 28 physical cores, and the
tests were run with cache_shard_size=28 to align shards with the LLC
topology.

fio configuration:
ioengine=libaio
rw=randwrite
bs=4k
iodepth=256
numjobs=224
runtime=60s
time_based=1
direct=0 (buffered I/O)

I compared:
- Default cache affinity
- Unpatched cache_shard (size=28)
- Patched cache_shard (size=28)

Lock contention and throughput metrics were collected using perf lock
record across four independent runs of the workload, and the reported
values are averages of those runs.

Average results:

Configuration Avg IOPS (k) Avg Context
Switches Avg Lock Wait Time
--------------------------------------------------------------------------------------------------------------------
Cache 191.5 1.54M
1.30 h
Unpatched cache_shard 189.8 1.60M
1.34 h
Patched cache_shard 191.8 1.51M
1.29 h

Throughput remained very similar across all configurations, while the
patched configuration showed slightly lower lock wait times and
context-switch counts than the unpatched cache_shard configuration.

For this workload, I did not observe a measurable throughput
improvement, but the patch appears to reduce some scheduler and
locking overhead.

Tested-by: Krishna Magar <kmagar@xxxxxxxxxx>

Thanks,
Krishna

On Tue, Jun 16, 2026 at 7:04 PM Breno Leitao <leitao@xxxxxxxxxx> wrote:
>
> The goal of this patchset is to decrease the time spent under the
> workqueue pool->lock.
>
> Currently the worker process is woken up inside pool->lock. The wakeup
> ends in wake_up_process(), which takes the target task's rq->lock, so
> rq->lock nests under pool->lock on the two hottest paths of a contended
> unbound workqueue (__queue_work() enqueue and process_one_work() chain
> kick). On some architectures the wakeup is even more expensive: on
> arm64 waking a CPU that is idle (in wfi) issues an IPI.
>
> Doing all of that while holding pool->lock lengthens the locked region
> and hurts throughput on contended unbound pools.
>
> This series shortens the locked region by selecting and claiming the
> worker to wake under pool->lock, but issuing the actual wakeup after the
> lock is dropped, using the wake_q machinery (wake_q_add() under the
> lock, wake_up_q() after).
>
> Because the win is a shorter pool->lock hold time, it shows up most
> clearly as lower enqueue latency under contention.
>
> Performance numbers (based on in-kernel workqueue microbenchmark)
>
> VMs and arm64 (Grace) is where this series is meant to pay off -- waking
> an idle CPU sitting in wfi costs an IPI (on arm; similar type of
> operation on VMs), so doing it under pool->lock lengthens the critical
> section.
>
> The arm64 bare-metal numbers match what the x86-or-arm64 VM showed:
>
> affinity_scope baseline patched tput p95
> (items/s) (items/s) gain drop
> -------------- --------- --------- ------ ------
> cpu 2,569,880 3,029,740 +17.9% -13.6%
> smt 2,586,485 3,044,788 +17.7% -14.0%
> cache_shard 572,055 797,621 +39.4% -37.1%
> cache 538,132 724,997 +34.7% -30.1%
> numa 528,673 658,215 +24.5% -20.5%
> system 524,287 614,486 +17.2% -21.1%
>
> (p95 drop = change in p95 enqueue latency; negative is better.)
> (tput gain = number of requests enqueued per sec; bigger is better.)
>
> Patch 1 is a pure refactor introducing kick_pool_pick().
> Patch 2 defers the wakeup on the enqueue path (__queue_work()).
> Patch 3 defers the wakeup on the per-work chain-kick path
> (process_one_work()).
>
> Changes in v3:
> - Drop the "park kicked worker on pool->kicked_list" patch (v2 1/4).
> * That is a fix that is independent of this patch, in case we want to
> revamp it, it can be sent separately.
> - Link to v2: https://lore.kernel.org/r/20260603-fastwake-v2-0-2977512fe7fa@xxxxxxxxxx
>
> Changes in v2:
> - Close the idle_cull_fn() vs kicked-worker race by parking the kicked
> worker on a new pool->kicked_list under pool->lock (new patch 1).
> Reported by Hillf Danton.
> - Use the wake_q machinery (wake_q_add() / wake_up_q() via
> raw_spin_unlock_wake()) instead of plumbing a task_struct out of the
> helper by hand. Suggested by Sebastian Andrzej Siewior.
> - Link to v1: https://lore.kernel.org/r/20260526-fastwake-v1-0-e69ad86923e6@xxxxxxxxxx
>
> Signed-off-by: Breno Leitao <leitao@xxxxxxxxxx>
> ---
> To: Tejun Heo <tj@xxxxxxxxxx>
> To: Lai Jiangshan <jiangshanlai@xxxxxxxxx>
> Cc: linux-kernel@xxxxxxxxxxxxxxx
>
> ---
> Breno Leitao (3):
> workqueue: split kick_pool() into kick_pool_pick() + wake_up_q()
> workqueue: defer the worker wakeup outside pool->lock in __queue_work()
> workqueue: defer the worker wakeup outside pool->lock in process_one_work()
>
> kernel/workqueue.c | 42 +++++++++++++++++++++++++++++++++---------
> 1 file changed, 33 insertions(+), 9 deletions(-)
> ---
> base-commit: 8d6dbbbe3ba62de0a63e962ee004afb848c8e3ac
> change-id: 20260526-fastwake-02982fd66312
>
> Best regards,
> --
> Breno Leitao <leitao@xxxxxxxxxx>
>