Re: [PATCHSET v1 wq/for-6.5] workqueue: Improve unbound workqueue execution locality

From: Tejun Heo
Date: Fri May 19 2023 - 19:03:15 EST


Oh, a bit of addition.

Once below saturation, latency and bw are mostly the two sides of the same
coin but just to be sure, here are latency results. The single-threaded sync
IO is run with 1ms interval between IOs.

taskset 0x8 fio --filename=/dev/dm-0 --direct=1 --rw=randrw --bs=512 \
--ioengine=sync --iodepth=1 --runtime=60 --numjobs=1 --time_based \
--group_reporting --name=iops-test-job --verify=sha512 --thinktime=1ms

SYSTEM

read: IOPS=480, BW=240KiB/s (246kB/s)(14.1MiB/60001msec)
clat (usec): min=8, max=401, avg=30.96, stdev= 9.60
lat (usec): min=8, max=401, avg=31.01, stdev= 9.60
clat percentiles (usec):
| 1.00th=[ 11], 5.00th=[ 13], 10.00th=[ 25], 20.00th=[ 27],
| 30.00th=[ 28], 40.00th=[ 29], 50.00th=[ 29], 60.00th=[ 30],
| 70.00th=[ 32], 80.00th=[ 42], 90.00th=[ 44], 95.00th=[ 44],
| 99.00th=[ 46], 99.50th=[ 46], 99.90th=[ 56], 99.95th=[ 71],
| 99.99th=[ 253]
bw ( KiB/s): min= 214, max= 265, per=99.85%, avg=240.29, stdev=11.35, samples=119
iops : min= 428, max= 530, avg=480.59, stdev=22.70, samples=119

CPU_STRICT

read: IOPS=474, BW=237KiB/s (243kB/s)(385KiB/1624msec)
clat (usec): min=9, max=240, avg=28.00, stdev=11.20
lat (usec): min=9, max=240, avg=28.05, stdev=11.20
clat percentiles (usec):
| 1.00th=[ 12], 5.00th=[ 26], 10.00th=[ 26], 20.00th=[ 26],
| 30.00th=[ 27], 40.00th=[ 28], 50.00th=[ 28], 60.00th=[ 28],
| 70.00th=[ 29], 80.00th=[ 30], 90.00th=[ 31], 95.00th=[ 31],
| 99.00th=[ 32], 99.50th=[ 50], 99.90th=[ 241], 99.95th=[ 241],
| 99.99th=[ 241]

CACHE

read: IOPS=479, BW=240KiB/s (245kB/s)(14.0MiB/60002msec)
clat (nsec): min=7874, max=75922, avg=13342.34, stdev=6906.53
lat (nsec): min=7904, max=75952, avg=13386.08, stdev=6906.60
clat percentiles (nsec):
| 1.00th=[ 8384], 5.00th=[ 8896], 10.00th=[ 9152], 20.00th=[ 9408],
| 30.00th=[ 9536], 40.00th=[ 9920], 50.00th=[10432], 60.00th=[10688],
| 70.00th=[11072], 80.00th=[13632], 90.00th=[27264], 95.00th=[28288],
| 99.00th=[30592], 99.50th=[30848], 99.90th=[41216], 99.95th=[56064],
| 99.99th=[74240]
bw ( KiB/s): min= 204, max= 269, per=99.69%, avg=239.67, stdev=11.02, samples=119
iops : min= 408, max= 538, avg=479.34, stdev=22.04, samples=119


It's a bit confusing because fio switched to printing nsecs for CACHE but
CPU_STRICT (per-cpu)'s average completion latency is, expectedly, better
than SYSTEM - 28ms vs. 31ms, but CACHE's is way better at 13.3ms.

Thanks.

--
tejun