Re: workqueue: introduce cache_shard_size

From: Bhithashri385

Date: Wed Apr 22 2026 - 14:49:53 EST


Hi,

I did a quick sanity check of cache_shard_size on a dual-socket system:

* 2x28C/56T (112 CPUs total), x86_64
* single local NVMe (XFS)
* upstream kernel with this change

Workload:

fio, 4k buffered writes with fsync=1
numjobs = 56 / 112 / 168

Compared:
workqueue.cache_shard_size=1 vs 8

Results (IOPS / BW):

jobs=56:
shard=1: ~265k IOPS, ~1035 MiB/s
shard=8: ~276k IOPS, ~1078 MiB/s

jobs=112:
shard=1: ~248k IOPS, ~968 MiB/s
shard=8: ~241k IOPS, ~941 MiB/s

jobs=168:
shard=1: ~234k IOPS, ~912 MiB/s
shard=8: ~233k IOPS, ~909 MiB/s

fsync latency (avg):

jobs=56:
shard=1: ~33 us
shard=8: ~28 us

jobs=112:
shard=1: ~53 us
shard=8: ~47 us

jobs=168:
shard=1: ~279 us
shard=8: ~256 us

Observations:

* Small improvement (~4%) at moderate concurrency (56 jobs).
* Differences mostly disappear at higher concurrency; both configs
converge once the device is saturated (~93–96% util).
* fsync latency is consistently a bit lower with sharding.

Interpretation:

This workload appears to become device-bound quickly, so the benefit
from reduced workqueue contention is limited. The small gain at
moderate load and lower fsync latency suggest sharding is helping
somewhat before hitting the storage bottleneck.

I haven't tried multi-device or more metadata-heavy workloads yet,
which may stress workqueues more directly.

Thanks,
Hithashree