Re: [PATCH v6 1/4] block: add task-context bio completion infrastructure

From: Jan Kara

Date: Wed May 27 2026 - 05:42:42 EST

On Tue 26-05-26 15:29:28, Tal Zussman wrote:
> On 5/25/26 1:17 AM, Christoph Hellwig wrote:
> > On Fri, May 22, 2026 at 06:47:43PM -0400, Tal Zussman wrote:
> >> > But this 1-jiffie delay also means we unconditionally increase
> >> > completion latency, which feels like a bad idea. Do you have any
> >> > measurements that show where it does benefit? Note that queing work
> >> > already often has very measurable latency on it's own. This also
> >> > directly contradics the erofs experience that even went to a RT
> >> > thread to reduce the latency.
> >>
> >> I added this per Dave's feedback on v4, where he noted that XFS inodegc
> >> uses a delayed work item to avoid context switch storms. There's only a
> >> delay for the first bio in a batch to complete, as we only delay when the
> >> list is empty. I'll run some experiments and measure context switches,
> >> completion latency, etc. to see if this is necessary.
> >
> > The difference is that XFS inodegc is not latency bound. Most of the
> > time no one cares if it is delayed a bit, in the cases where someone
> > cares we explicitly flush the queues. I/O completion on the other hand
> > is something where users very much care about latency.
> >
>
> I ran some experiments with fio on both XFS and a raw block device. Five
> iterations each for 60s. Results below.
>
> TLDR: Removing the delay doesn't significantly decrease user-visible
> latency or otherwise improve performance, but does significantly reduce
> throughput and increase context switches in some workloads (e.g. C).
> I think it makes sense to leave the delay as-is. Thoughts?

Thanks for the test! One question below:

> Results:
>
> Workloads (all `uncached=1`):
> A: rw=write bs=128k iodepth=1 ioengine=pvsync2 # XFS
> B: rw=write bs=128k iodepth=128 ioengine=io_uring # XFS
> C: rw=randwrite bs=4k iodepth=32 ioengine=io_uring # XFS
> D: rw=rw 50/50 bs=64k iodepth=32 ioengine=io_uring # XFS
> E: rw=write bs=128k iodepth=128 ioengine=io_uring # raw /dev/nvmeXn1
> F: rw=write bs=128k iodepth=128 numjobs=4
> + vm.dirty_bytes=64MB, vm.dirty_background_bytes=32MB # XFS
>
> Mean ± stddev across 5 iterations:
>
> metric delay=1 delay=0 delta
> --------------------------------------------------------------
>
> A seq 128k qd1
> BW (MB/s) 4333 ± 27 4374 ± 34 +0.9%
> p99 (us) 36.2 ± 0.8 35.8 ± 0.4 -1.1%
> p999 (us) 3260 ± 75 3228 ± 29 -1.0%
> ctx-switches 184 k ± 59 k 3.68 M ± 65 k +1903%
> cs / io 0.09 ± 0.03 1.86 ± 0.03 +1888%
> avg bios/run 80.4 ± 0.6 1.1 ± 0.0 -98.7%

So 1 jiffie delay is (with default HZ=1000) 1ms. That means for this load
the completion latency should be at least 1000us but your results show p99
latency of 36. What am I missing?

> B seq 128k qd128
> BW (MB/s) 4393 ± 3.3 4311 ± 5.3 -1.9%
> p99 (us) 8461 ± 73 8638 ± 105 +2.1%
> p999 (us) 12465 ± 213 12386 ± 299 -0.6%
> ctx-switches 6.90 M ± 186 k 9.72 M ± 184 k +40.7%
> cs / io 3.43 ± 0.10 4.92 ± 0.10 +43.4%
> avg bios/run 51.9 ± 2.2 1.3 ± 0.0 -97.4%
>
> C rand 4k qd32
> BW (MB/s) 66.2 ± 0.8 44.6 ± 7.4 -32.7%
> p99 (us) 8002 ± 174 17990 ± 6800 +124.8%
> p999 (us) 11390 ± 554 31890 ± 11076 +180.0%
> ctx-switches 3.67 M ± 45 k 3.59 M ± 106 k -2.2%
> cs / io 3.78 ± 0.04 5.62 ± 0.83 +48.7%
> avg bios/run 32.3 ± 1.0 3.1 ± 0.3 -90.5%

I'm somewhat surprised how larger is the completion latency is here without
the delay. Is that due to a contention on local lock between the IO completion
interrupt and the worker? Or why is the completion latency so big here when
the case B with more IOs in flight, less bios per run, still had significantly
lower latency in the delay=0 case?

Honza

> D mixed 50/50 r/w 64k qd32
> write BW (MB/s) 892.4 ± 20.9 925.3 ± 18.3 +3.7%
> write p99 (us) 3562 ± 107 3601 ± 82 +1.1%
> write p999 (us) 4673 ± 217 4647 ± 107 -0.6%
> read BW (MB/s) 893.6 ± 20.8 926.6 ± 18.4 +3.7%
> read p99 (us) 1003 ± 48 1035 ± 39 +3.2%
> read p999 (us) 1545 ± 63 1476 ± 50 -4.5%
> ctx-switches 5.15 M ± 75 k 5.79 M ± 230 k +12.6%
> cs / io 6.32 ± 0.15 6.85 ± 0.20 +8.5%
> avg bios/run 23.9 ± 0.3 2.5 ± 0.0 -89.4%
>
> E raw 128k qd128
> BW (MB/s) 1043 ± 1.0 1045 ± 0.5 +0.1%
> p99 (us) 26922 ± 105 27027 ± 128 +0.4%
> p999 (us) 37906 ± 4527 37408 ± 2464 -1.3%
> ctx-switches 3.20 M ± 6 k 3.33 M ± 10 k +3.8%
> cs / io 6.71 ± 0.01 6.95 ± 0.02 +3.7%
> avg bios/run 38.0 ± 0.1 32.0 ± 0.0 -15.6%
>
> F mem-pressure (dirty_bytes=64MB, 4 writers)
> BW (MB/s) 4361 ± 24 4444 ± 40 +1.9%
> p99 (us) 29439 ± 419 30173 ± 788 +2.5%
> p999 (us) 35704 ± 1773 36648 ± 535 +2.6%
> ctx-switches 20.8 M ± 1.6 M 27.1 M ± 1.4 M +30.1%
> cs / io 6.94 ± 0.49 8.87 ± 0.46 +27.8%
> avg bios/run 23.6 ± 0.3 1.2 ± 0.0 -94.9%
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR