Re: [PATCH v6 1/4] block: add task-context bio completion infrastructure

From: Tal Zussman

Date: Fri May 29 2026 - 16:46:50 EST

On 5/27/26 9:00 AM, Christoph Hellwig wrote:
> On Wed, May 27, 2026 at 11:42:28AM +0200, Jan Kara wrote:
>> > I ran some experiments with fio on both XFS and a raw block device. Five
>> > iterations each for 60s. Results below.
>> >
>> > TLDR: Removing the delay doesn't significantly decrease user-visible
>> > latency or otherwise improve performance, but does significantly reduce
>> > throughput and increase context switches in some workloads (e.g. C).
>> > I think it makes sense to leave the delay as-is. Thoughts?
>>
>> Thanks for the test! One question below:
>
> Thanks from me as well!
>
>>
>> > Results:
>> >
>> > Workloads (all `uncached=1`):
>> > A: rw=write bs=128k iodepth=1 ioengine=pvsync2 # XFS
>> > B: rw=write bs=128k iodepth=128 ioengine=io_uring # XFS
>> > C: rw=randwrite bs=4k iodepth=32 ioengine=io_uring # XFS
>> > D: rw=rw 50/50 bs=64k iodepth=32 ioengine=io_uring # XFS
>> > E: rw=write bs=128k iodepth=128 ioengine=io_uring # raw /dev/nvmeXn1
>> > F: rw=write bs=128k iodepth=128 numjobs=4
>> > + vm.dirty_bytes=64MB, vm.dirty_background_bytes=32MB # XFS
>> >
>> > Mean ± stddev across 5 iterations:
>> >
>> > metric delay=1 delay=0 delta
>> > --------------------------------------------------------------
>> >
>> > A seq 128k qd1
>> > BW (MB/s) 4333 ± 27 4374 ± 34 +0.9%
>> > p99 (us) 36.2 ± 0.8 35.8 ± 0.4 -1.1%
>> > p999 (us) 3260 ± 75 3228 ± 29 -1.0%
>> > ctx-switches 184 k ± 59 k 3.68 M ± 65 k +1903%
>> > cs / io 0.09 ± 0.03 1.86 ± 0.03 +1888%
>> > avg bios/run 80.4 ± 0.6 1.1 ± 0.0 -98.7%
>>
>> So 1 jiffie delay is (with default HZ=1000) 1ms. That means for this load
>> the completion latency should be at least 1000us but your results show p99
>> latency of 36. What am I missing?
>
> Yes, this looks a bit odd. Unless there's multiple threads submitting
> and somehow the completions get batched this should complete one
> bio at a time and be the worst case for the delay scheme.

Sorry, I should've clarified - the latency here is the userspace-visible
I/O completion latency (i.e. fio's clat value).

I ran again and traced to get the actual time from __bio_complete_in_task()
to calling ->bi_end_io(). The results match the 1 jiffie delay now:

metric delay=1 delay=0

A seq 128k qd1
fio clat p99 38us 36us
bio cb p50 1.23ms 2.5us
bio cb p99 4.13ms 1.44ms
bio cb p999 5.01ms 2.63ms

B seq 128k qd128
fio clat p99 8.74ms 8.85ms
bio cb p50 1.27ms 3.1us
bio cb p99 4.05ms 2.27ms
bio cb p999 4.91ms 2.77ms

C rand 4k qd32
fio clat p99 8.16ms 8.11ms
bio cb p50 1.09ms 97.7us
bio cb p99 3.73ms 2.06ms
bio cb p999 11.87ms 3.79ms

D mixed 64k qd32
fio clat p99 981us 1.03ms
bio cb p50 1.14ms 39.5us
bio cb p99 2.83ms 275us
bio cb p999 3.06ms 595us

E raw 128k qd128
fio clat p99 26.97ms 27.34ms
bio cb p50 1.58ms 41.5us
bio cb p99 2.98ms 325us
bio cb p999 3.02ms 575us

F mem-pressure
fio clat p99 29.75ms 30.43ms
bio cb p50 1.32ms 2.5us
bio cb p99 3.73ms 2.48ms
bio cb p999 4.62ms 2.83ms

Note that in the above, the C degradation didn't reproduce as much. The
bandwidth does go down from 64.5 MB/s with delay=1 to 54.9 MB/s with delay=0,
but it's a much smaller drop. I ran it several more times and ran into the
degradation ~20% of the time. The lack of batching means the completion
kworker fires for nearly every bio, leading to heavier preemption when a
writer is placed on a CPU that receives many completion IRQs. The degradation
seems to occur when the writers are migrated less often, leading to more
preemption. But I haven't dug into why the scheduler chooses to migrate more
in some runs vs. others. However, when pinning to 16 cores, the difference
between delay=0 and delay=1 goes away.

C specifically also seems to get worse because we're doing random writes to a
sparse file, so each bio goes through the IOMAP_IOEND_UNWRITTEN path and the
completion path is heavier, leading to more CPU stealing from the writing
threads compared to the other workloads.

>> > C rand 4k qd32
>> > BW (MB/s) 66.2 ± 0.8 44.6 ± 7.4 -32.7%
>> > p99 (us) 8002 ± 174 17990 ± 6800 +124.8%
>> > p999 (us) 11390 ± 554 31890 ± 11076 +180.0%
>> > ctx-switches 3.67 M ± 45 k 3.59 M ± 106 k -2.2%
>> > cs / io 3.78 ± 0.04 5.62 ± 0.83 +48.7%
>> > avg bios/run 32.3 ± 1.0 3.1 ± 0.3 -90.5%
>>
>> I'm somewhat surprised how larger is the completion latency is here without
>> the delay. Is that due to a contention on local lock between the IO completion
>> interrupt and the worker? Or why is the completion latency so big here when
>> the case B with more IOs in flight, less bios per run, still had significantly
>> lower latency in the delay=0 case?
>
> Note that in the past we had major problems with workqueue scheduling
> latency. At some point these got mitigated a lot, but if they are back
> for this workload that might be one reason.
>