Re: [PATCH v6 1/4] block: add task-context bio completion infrastructure
From: Christoph Hellwig
Date: Wed May 27 2026 - 09:13:21 EST
On Wed, May 27, 2026 at 11:42:28AM +0200, Jan Kara wrote:
> > I ran some experiments with fio on both XFS and a raw block device. Five
> > iterations each for 60s. Results below.
> >
> > TLDR: Removing the delay doesn't significantly decrease user-visible
> > latency or otherwise improve performance, but does significantly reduce
> > throughput and increase context switches in some workloads (e.g. C).
> > I think it makes sense to leave the delay as-is. Thoughts?
>
> Thanks for the test! One question below:
Thanks from me as well!
>
> > Results:
> >
> > Workloads (all `uncached=1`):
> > A: rw=write bs=128k iodepth=1 ioengine=pvsync2 # XFS
> > B: rw=write bs=128k iodepth=128 ioengine=io_uring # XFS
> > C: rw=randwrite bs=4k iodepth=32 ioengine=io_uring # XFS
> > D: rw=rw 50/50 bs=64k iodepth=32 ioengine=io_uring # XFS
> > E: rw=write bs=128k iodepth=128 ioengine=io_uring # raw /dev/nvmeXn1
> > F: rw=write bs=128k iodepth=128 numjobs=4
> > + vm.dirty_bytes=64MB, vm.dirty_background_bytes=32MB # XFS
> >
> > Mean ± stddev across 5 iterations:
> >
> > metric delay=1 delay=0 delta
> > --------------------------------------------------------------
> >
> > A seq 128k qd1
> > BW (MB/s) 4333 ± 27 4374 ± 34 +0.9%
> > p99 (us) 36.2 ± 0.8 35.8 ± 0.4 -1.1%
> > p999 (us) 3260 ± 75 3228 ± 29 -1.0%
> > ctx-switches 184 k ± 59 k 3.68 M ± 65 k +1903%
> > cs / io 0.09 ± 0.03 1.86 ± 0.03 +1888%
> > avg bios/run 80.4 ± 0.6 1.1 ± 0.0 -98.7%
>
> So 1 jiffie delay is (with default HZ=1000) 1ms. That means for this load
> the completion latency should be at least 1000us but your results show p99
> latency of 36. What am I missing?
Yes, this looks a bit odd. Unless there's multiple threads submitting
and somehow the completions get batched this should complete one
bio at a time and be the worst case for the delay scheme.
> > C rand 4k qd32
> > BW (MB/s) 66.2 ± 0.8 44.6 ± 7.4 -32.7%
> > p99 (us) 8002 ± 174 17990 ± 6800 +124.8%
> > p999 (us) 11390 ± 554 31890 ± 11076 +180.0%
> > ctx-switches 3.67 M ± 45 k 3.59 M ± 106 k -2.2%
> > cs / io 3.78 ± 0.04 5.62 ± 0.83 +48.7%
> > avg bios/run 32.3 ± 1.0 3.1 ± 0.3 -90.5%
>
> I'm somewhat surprised how larger is the completion latency is here without
> the delay. Is that due to a contention on local lock between the IO completion
> interrupt and the worker? Or why is the completion latency so big here when
> the case B with more IOs in flight, less bios per run, still had significantly
> lower latency in the delay=0 case?
Note that in the past we had major problems with workqueue scheduling
latency. At some point these got mitigated a lot, but if they are back
for this workload that might be one reason.