Re: [PATCH v6 1/4] block: add task-context bio completion infrastructure

From: Tal Zussman

Date: Mon Jun 22 2026 - 13:05:27 EST

On 6/18/26 10:26 AM, Jan Kara wrote:
> On Mon 01-06-26 13:04:41, Jan Kara wrote:
>> On Fri 29-05-26 16:46:15, Tal Zussman wrote:
>> > On 5/27/26 9:00 AM, Christoph Hellwig wrote:
>> > > On Wed, May 27, 2026 at 11:42:28AM +0200, Jan Kara wrote:
>> > >> > I ran some experiments with fio on both XFS and a raw block device. Five
>> > >> > iterations each for 60s. Results below.
>> > >> >
>> > >> > TLDR: Removing the delay doesn't significantly decrease user-visible
>> > >> > latency or otherwise improve performance, but does significantly reduce
>> > >> > throughput and increase context switches in some workloads (e.g. C).
>> > >> > I think it makes sense to leave the delay as-is. Thoughts?
>> > >>
>> > >> Thanks for the test! One question below:
>> > >
>> > > Thanks from me as well!
>> > >
>> > >>
>> > >> > Results:
>> > >> >
>> > >> > Workloads (all `uncached=1`):
>> > >> > A: rw=write bs=128k iodepth=1 ioengine=pvsync2 # XFS
>> > >> > B: rw=write bs=128k iodepth=128 ioengine=io_uring # XFS
>> > >> > C: rw=randwrite bs=4k iodepth=32 ioengine=io_uring # XFS
>> > >> > D: rw=rw 50/50 bs=64k iodepth=32 ioengine=io_uring # XFS
>> > >> > E: rw=write bs=128k iodepth=128 ioengine=io_uring # raw /dev/nvmeXn1
>> > >> > F: rw=write bs=128k iodepth=128 numjobs=4
>> > >> > + vm.dirty_bytes=64MB, vm.dirty_background_bytes=32MB # XFS
>> > >> >
>> > >> > Mean ± stddev across 5 iterations:
>> > >> >
>> > >> > metric delay=1 delay=0 delta
>> > >> > --------------------------------------------------------------
>> > >> >
>> > >> > A seq 128k qd1
>> > >> > BW (MB/s) 4333 ± 27 4374 ± 34 +0.9%
>> > >> > p99 (us) 36.2 ± 0.8 35.8 ± 0.4 -1.1%
>> > >> > p999 (us) 3260 ± 75 3228 ± 29 -1.0%
>> > >> > ctx-switches 184 k ± 59 k 3.68 M ± 65 k +1903%
>> > >> > cs / io 0.09 ± 0.03 1.86 ± 0.03 +1888%
>> > >> > avg bios/run 80.4 ± 0.6 1.1 ± 0.0 -98.7%
>> > >>
>> > >> So 1 jiffie delay is (with default HZ=1000) 1ms. That means for this load
>> > >> the completion latency should be at least 1000us but your results show p99
>> > >> latency of 36. What am I missing?
>> > >
>> > > Yes, this looks a bit odd. Unless there's multiple threads submitting
>> > > and somehow the completions get batched this should complete one
>> > > bio at a time and be the worst case for the delay scheme.
>> >
>> > Sorry, I should've clarified - the latency here is the userspace-visible
>> > I/O completion latency (i.e. fio's clat value).
>> >
>> > I ran again and traced to get the actual time from __bio_complete_in_task()
>> > to calling ->bi_end_io(). The results match the 1 jiffie delay now:
>> >
>> > metric delay=1 delay=0
>> >
>> > A seq 128k qd1
>> > fio clat p99 38us 36us
>> > bio cb p50 1.23ms 2.5us
>> > bio cb p99 4.13ms 1.44ms
>> > bio cb p999 5.01ms 2.63ms
>>
>> So I'm clearly missing something fundamental as I don't see how can fio
>> reported IO completion time be lower than the end_io callback latency...
>> Ahh, it is the strange meaning of clat in fio in combination with sync
>> engine where clat means: "how long after the syscall has returned the data
>> is ready". Which for sync engine is immediately so the clat number is
>> meaningless. I think reporting 'lat' numbers from fio would make more
>> sense but whatever.
>>
>> The bio cb latency indeed looks like what I'd roughly expect now. And
>> notice how the median latency of IO completion is 1.23ms in delay=1 case
>> and your throughput isn't abbysmal only because writes end up accumulating
>> in the page cache and writeback infrastructure ends up submitting a lot of
>> writeback IOs in parallel (you have ~80 bios to complete per run which
>> amortizes the latency to decent level).
>>
>> However if you'd have IO that were to use BIO_COMPLETE_IN_TASK
>> infrastructure which doesn't have so many IOs in flight (like direct IO
>> with lower queue depth which has to do extent conversion on completion),
>> you would very much see the latency hit on your throughput as well. In the
>> extreme case of qd=1 direct IO you'd reduce the throughput to ~4MB/s.
>>
>> Now I'm not saying the delay is bad - it is a tradeoff with clear wins in
>> CPU overhead your benchmarks are showing. I just wanted to point out
>> there's also the cost side which your benchmarks don't show very clearly.
>> So we might need to keep some stats showing how many IO completions we are
>> offloading per second on each CPU and switch to delaying the work only once
>> it crosses a threshold like 1000000/HZ per second or so (so we at most
>> double the IO latency by delaying the end_io callback).
>
> Any progress here? The patchset looks really promising so I'd love to have
> it completed :)
>
Sorry for the delay - got caught up with some other work and had to set this
aside for a couple weeks, but haven't forgotten about this. Planning to pick
it back up some time this week.

Thanks,
Tal