Re: workqueue: introduce cache_shard_size
From: Bhithashri385
Date: Wed Apr 22 2026 - 14:49:53 EST
Hi,
I did a quick sanity check of cache_shard_size on a dual-socket system:
* 2x28C/56T (112 CPUs total), x86_64
* single local NVMe (XFS)
* upstream kernel with this change
Workload:
fio, 4k buffered writes with fsync=1
numjobs = 56 / 112 / 168
Compared:
workqueue.cache_shard_size=1 vs 8
Results (IOPS / BW):
jobs=56:
shard=1: ~265k IOPS, ~1035 MiB/s
shard=8: ~276k IOPS, ~1078 MiB/s
jobs=112:
shard=1: ~248k IOPS, ~968 MiB/s
shard=8: ~241k IOPS, ~941 MiB/s
jobs=168:
shard=1: ~234k IOPS, ~912 MiB/s
shard=8: ~233k IOPS, ~909 MiB/s
fsync latency (avg):
jobs=56:
shard=1: ~33 us
shard=8: ~28 us
jobs=112:
shard=1: ~53 us
shard=8: ~47 us
jobs=168:
shard=1: ~279 us
shard=8: ~256 us
Observations:
* Small improvement (~4%) at moderate concurrency (56 jobs).
* Differences mostly disappear at higher concurrency; both configs
converge once the device is saturated (~93–96% util).
* fsync latency is consistently a bit lower with sharding.
Interpretation:
This workload appears to become device-bound quickly, so the benefit
from reduced workqueue contention is limited. The small gain at
moderate load and lower fsync latency suggest sharding is helping
somewhat before hitting the storage bottleneck.
I haven't tried multi-device or more metadata-heavy workloads yet,
which may stress workqueues more directly.
Thanks,
Hithashree