Re: [PATCH v6 1/4] block: add task-context bio completion infrastructure

From: Tal Zussman

Date: Tue May 26 2026 - 15:29:54 EST

On 5/25/26 1:17 AM, Christoph Hellwig wrote:
> On Fri, May 22, 2026 at 06:47:43PM -0400, Tal Zussman wrote:
>> > But this 1-jiffie delay also means we unconditionally increase
>> > completion latency, which feels like a bad idea. Do you have any
>> > measurements that show where it does benefit? Note that queing work
>> > already often has very measurable latency on it's own. This also
>> > directly contradics the erofs experience that even went to a RT
>> > thread to reduce the latency.
>>
>> I added this per Dave's feedback on v4, where he noted that XFS inodegc
>> uses a delayed work item to avoid context switch storms. There's only a
>> delay for the first bio in a batch to complete, as we only delay when the
>> list is empty. I'll run some experiments and measure context switches,
>> completion latency, etc. to see if this is necessary.
>
> The difference is that XFS inodegc is not latency bound. Most of the
> time no one cares if it is delayed a bit, in the cases where someone
> cares we explicitly flush the queues. I/O completion on the other hand
> is something where users very much care about latency.
>

I ran some experiments with fio on both XFS and a raw block device. Five
iterations each for 60s. Results below.

TLDR: Removing the delay doesn't significantly decrease user-visible
latency or otherwise improve performance, but does significantly reduce
throughput and increase context switches in some workloads (e.g. C).
I think it makes sense to leave the delay as-is. Thoughts?

Results:

Workloads (all `uncached=1`):
A: rw=write bs=128k iodepth=1 ioengine=pvsync2 # XFS
B: rw=write bs=128k iodepth=128 ioengine=io_uring # XFS
C: rw=randwrite bs=4k iodepth=32 ioengine=io_uring # XFS
D: rw=rw 50/50 bs=64k iodepth=32 ioengine=io_uring # XFS
E: rw=write bs=128k iodepth=128 ioengine=io_uring # raw /dev/nvmeXn1
F: rw=write bs=128k iodepth=128 numjobs=4
+ vm.dirty_bytes=64MB, vm.dirty_background_bytes=32MB # XFS

Mean ± stddev across 5 iterations:

metric delay=1 delay=0 delta
--------------------------------------------------------------

A seq 128k qd1
BW (MB/s) 4333 ± 27 4374 ± 34 +0.9%
p99 (us) 36.2 ± 0.8 35.8 ± 0.4 -1.1%
p999 (us) 3260 ± 75 3228 ± 29 -1.0%
ctx-switches 184 k ± 59 k 3.68 M ± 65 k +1903%
cs / io 0.09 ± 0.03 1.86 ± 0.03 +1888%
avg bios/run 80.4 ± 0.6 1.1 ± 0.0 -98.7%

B seq 128k qd128
BW (MB/s) 4393 ± 3.3 4311 ± 5.3 -1.9%
p99 (us) 8461 ± 73 8638 ± 105 +2.1%
p999 (us) 12465 ± 213 12386 ± 299 -0.6%
ctx-switches 6.90 M ± 186 k 9.72 M ± 184 k +40.7%
cs / io 3.43 ± 0.10 4.92 ± 0.10 +43.4%
avg bios/run 51.9 ± 2.2 1.3 ± 0.0 -97.4%

C rand 4k qd32
BW (MB/s) 66.2 ± 0.8 44.6 ± 7.4 -32.7%
p99 (us) 8002 ± 174 17990 ± 6800 +124.8%
p999 (us) 11390 ± 554 31890 ± 11076 +180.0%
ctx-switches 3.67 M ± 45 k 3.59 M ± 106 k -2.2%
cs / io 3.78 ± 0.04 5.62 ± 0.83 +48.7%
avg bios/run 32.3 ± 1.0 3.1 ± 0.3 -90.5%

D mixed 50/50 r/w 64k qd32
write BW (MB/s) 892.4 ± 20.9 925.3 ± 18.3 +3.7%
write p99 (us) 3562 ± 107 3601 ± 82 +1.1%
write p999 (us) 4673 ± 217 4647 ± 107 -0.6%
read BW (MB/s) 893.6 ± 20.8 926.6 ± 18.4 +3.7%
read p99 (us) 1003 ± 48 1035 ± 39 +3.2%
read p999 (us) 1545 ± 63 1476 ± 50 -4.5%
ctx-switches 5.15 M ± 75 k 5.79 M ± 230 k +12.6%
cs / io 6.32 ± 0.15 6.85 ± 0.20 +8.5%
avg bios/run 23.9 ± 0.3 2.5 ± 0.0 -89.4%

E raw 128k qd128
BW (MB/s) 1043 ± 1.0 1045 ± 0.5 +0.1%
p99 (us) 26922 ± 105 27027 ± 128 +0.4%
p999 (us) 37906 ± 4527 37408 ± 2464 -1.3%
ctx-switches 3.20 M ± 6 k 3.33 M ± 10 k +3.8%
cs / io 6.71 ± 0.01 6.95 ± 0.02 +3.7%
avg bios/run 38.0 ± 0.1 32.0 ± 0.0 -15.6%

F mem-pressure (dirty_bytes=64MB, 4 writers)
BW (MB/s) 4361 ± 24 4444 ± 40 +1.9%
p99 (us) 29439 ± 419 30173 ± 788 +2.5%
p999 (us) 35704 ± 1773 36648 ± 535 +2.6%
ctx-switches 20.8 M ± 1.6 M 27.1 M ± 1.4 M +30.1%
cs / io 6.94 ± 0.49 8.87 ± 0.46 +27.8%
avg bios/run 23.6 ± 0.3 1.2 ± 0.0 -94.9%