[PATCH RFC 0/5] workqueue: add WQ_AFFN_CACHE_SHARD affinity scope

From: Breno Leitao

Date: Thu Mar 12 2026 - 12:19:39 EST


TL;DR: Some modern processors have many CPUs per LLC (L3 cache), and
unbound workqueues using the default affinity (WQ_AFFN_CACHE) collapse
to a single worker pool, causing heavy spinlock (pool->lock) contention.
Create a new affinity (WQ_AFFN_CACHE_SHARD) that caps each pool at
wq_cache_shard_size CPUs (default 8).

Problem
=======

Some modern systems have many CPUs sharing one LLC. Here are some examples I have
access to:

* NVIDIA Grace CPU: 72 real CPUs per LLC
* Intel(R) Xeon(R) Gold 6450C: 59 SMT threads per LLC
* Intel(R) Xeon(R) Platinum 8321HC: 51 SMT threads per LLC

On these systems, the default unbound workqueue uses the WQ_AFFN_CACHE
affinity, which results in just a single pool for the whole system (when
all the CPUs share the same LLC as the systems above).

This causes contention on pool->lock, potentially affecting IO
performance (btrfs, writeback, etc)

When profiling an IO-intensive usercache at Meta, I found significant
contention on __queue_work(), making it one of the top 5 contended
locks.

Additionally, Chuck Lever recently reported this problem:

"For example, on a 12-core system with a single shared L3 cache running
NFS over RDMA with 12 fio jobs, perf shows approximately 39% of CPU
cycles spent in native_queued_spin_lock_slowpath, nearly all from
__queue_work() contending on the single pool lock.

On such systems WQ_AFFN_CACHE, WQ_AFFN_SMT, and WQ_AFFN_NUMA
scopes all collapse to a single pod."

Link: https://lore.kernel.org/all/20260203143744.16578-1-cel@xxxxxxxxxx/

Solution
========

Tejun suggested solving this problem by creating an intermediate
affinity level (aka cache_shard), which would shard the WQ_AFFN_CACHE
using a heuristic, avoiding collapsing all those affinity levels to
a single pod.

Solve this by creating an intermediate sharded cache affinity, and use
it as the default one.

Micro benchmark
===============

To test its benefit, I created a microbenchmark (part of this series)
that enqueues work (queue_work) in a loop and reports the latency.

Benchmark on NVIDIA Grace (72 CPUs, single LLC, 50k items/thread):

cpu 3248519 items/sec p50=10944 p90=11488 p95=11648 ns
smt 3362119 items/sec p50=10945 p90=11520 p95=11712 ns
cache_shard 3629098 items/sec p50=6080 p90=8896 p95=9728 ns (NEW) **
cache 708168 items/sec p50=44000 p90=47104 p95=47904 ns
numa 710559 items/sec p50=44096 p90=47265 p95=48064 ns
system 718370 items/sec p50=43104 p90=46432 p95=47264 ns

Same benchmark on Intel 8321HC.

cpu 2831751 items/sec p50=3909 p90=9222 p95=11580 ns
smt 2810699 items/sec p50=2229 p90=4928 p95=5979 ns
cache_shard 1861028 items/sec p50=4874 p90=8423 p95=9415 ns (NEW)
cache 591001 items/sec p50=24901 p90=29865 p95=31169 ns
numa 590431 items/sec p50=24901 p90=29819 p95=31133 ns
system 591912 items/sec p50=25049 p90=29916 p95=31219 ns

(** It is still unclear why cache_shard is "better" than SMT on
Grace/ARM. The result is constantly reproducible, though. Still
investigating it)

Block benchmark
===============

Host: Intel(R) Xeon(R) D-2191A CPU @ 1.60GHz (16 Cores - 32 SMT)

In order to stress the workqueue, I am running fio on a dm-crypt device.

1) Create a plain dm-crypt device on top of NVMe
* cryptsetup creates an encrypted block device (/dev/mapper/crypt_nvme) on top
of a raw NVMe drive. All I/O to this device goes through kcryptd — dm-crypt's
workqueue that handles AES encryption/decryption of every data block.

# cryptsetup open --type plain -c aes-xts-plain64 -s 256 /dev/nvme0n1 crypt_nvme -d -

2) Run fio
* fio hammers the encrypted device with 36 threads (one per CPU), each doing
128-deep 4K _buffered_ I/O for 10 seconds. This generates massive workqueue
pressure — every I/O completion triggers a kcryptd work item to encrypt or
decrypt data.

# fio --filename=/dev/mapper/crypt_nvme \
--ioengine=io_uring --direct=0 \
--bs=4k --iodepth=128 \
--numjobs=$(nproc) --runtime=10 \
--time_based --group_reporting

Running this for ~3 hours:

┌────────────┬────────────────────────┬────────────────────────┬───────────┬────────┬─────────────────┐
│ Workload │ Avg cache │ Avg cache_shard │ Avg delta │ Stddev │ 2-sigma range │
├────────────┼────────────────────────┼────────────────────────┼───────────┼────────┼─────────────────┤
│ randread │ 389 MiB/s (99.6k IOPS) │ 413 MiB/s (106k IOPS) │ +5.9% │ 3.3% │ -0.7% to +12.5% │
├────────────┼────────────────────────┼────────────────────────┼───────────┼────────┼─────────────────┤
│ randwrite │ 622 MiB/s (159k IOPS) │ 614 MiB/s (157k IOPS) │ -1.3% │ 0.9% │ -3.1% to +0.5% │
├────────────┼────────────────────────┼────────────────────────┼───────────┼────────┼─────────────────┤
│ randrw │ 240 MiB/s (61.4k IOPS) │ 250 MiB/s (64.1k IOPS) │ +4.3% │ 3.4% │ -2.5% to +11.1% │
└────────────┴────────────────────────┴────────────────────────┴───────────┴────────┴─────────────────┘

Same results for buffered IO:

┌───────────┬────────────────────────┬────────────────────────┬───────────┬────────┬────────────────┐
│ Workload │ Avg cache │ Avg cache_shard │ Avg delta │ Stddev │ 2-sigma range │
├───────────┼────────────────────────┼────────────────────────┼───────────┼────────┼────────────────┤
│ randread │ 559 MiB/s (143k IOPS) │ 577 MiB/s (148k IOPS) │ +3.1% │ 1.3% │ +0.5% to +5.7% │
├───────────┼────────────────────────┼────────────────────────┼───────────┼────────┼────────────────┤
│ randwrite │ 437 MiB/s (112k IOPS) │ 431 MiB/s (110k IOPS) │ -1.5% │ 1.0% │ -3.5% to +0.5% │
├───────────┼────────────────────────┼────────────────────────┼───────────┼────────┼────────────────┤
│ randrw │ 272 MiB/s (69.7k IOPS) │ 273 MiB/s (69.8k IOPS) │ +0.1% │ 1.5% │ -2.9% to +3.1% │
└───────────┴────────────────────────┴────────────────────────┴───────────┴────────┴────────────────┘

(randwrite result seems to be noise (!?))

Patchset organization
=====================

This series adds a new WQ_AFFN_CACHE_SHARD affinity scope that
subdivides each LLC into groups of at most wq_cache_shard_size CPUs
(default 8, tunable via boot parameter), providing an intermediate
option between per-LLC and per-SMT-core granularity.

On top of the feature, this patchset also prepares the code for the new
cache_shard affinity, and creates a stress test for workqueue.

Then, make this new cache affinity the default one.

On systems with 8 or fewer CPUs per LLC, CACHE_SHARD produces a single
shard covering the entire LLC, making it functionally identical to the
previous CACHE default. The sharding only activates when an LLC has more
than 8 CPUs.

---
Breno Leitao (5):
workqueue: fix parse_affn_scope() prefix matching bug
workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope
workqueue: add test_workqueue benchmark module
tools/workqueue: add CACHE_SHARD support to wq_dump.py

include/linux/workqueue.h | 1 +
kernel/workqueue.c | 72 ++++++++++--
lib/Kconfig.debug | 10 ++
lib/Makefile | 1 +
lib/test_workqueue.c | 275 +++++++++++++++++++++++++++++++++++++++++++++
tools/workqueue/wq_dump.py | 3 +-
6 files changed, 352 insertions(+), 10 deletions(-)
---
base-commit: b29fb8829bff243512bb8c8908fd39406f9fd4c3
change-id: 20260309-workqueue_sharded-2327956e889b

Best regards,
--
Breno Leitao <leitao@xxxxxxxxxx>