The way the on-stack plugging currently works, each nesting level
flushes its own list of I/Os. This can be less than optimal (read
awful) for certain workloads. For example, consider an application
that issues asynchronous O_DIRECT I/Os. It can send down a bunch of
I/Os together in a single io_submit call, only to have each of them
dispatched individually down in the bowells of the dirct I/O code.
The reason is that there are blk_plug's instantiated both at the upper
call site in do_io_submit and down in do_direct_IO. The latter will
submit as little as 1 I/O at a time (if you have a small enough I/O
size) instead of performing the batching that the plugging
infrastructure is supposed to provide.
Now, for the case where there is an elevator involved, this doesn't
really matter too much. The elevator will keep the I/O around long
enough for it to be merged. However, in cases where there is no
elevator (like blk-mq), I/Os are simply dispatched immediately.
Try this, for example (note I'm using a virtio-blk device, so it's
using the blk-mq single queue path, though I've also reproduced this
with the micron p320h):
fio --rw=read --bs=4k --iodepth=128 --iodepth_batch=16 --iodepth_batch_complete=16 --runtime=10s --direct=1 --filename=/dev/vdd --name=job1 --ioengine=libaio --time_based
If you run that on a current kernel, you will get zero merges. Zero!
After this patch, you will get many merges (the actual number depends
on how fast your storage is, obviously), and much better throughput.
Here are results from my test rig:
Unpatched kernel:
Read B/W: 283,638 KB/s
Read Merges: 0
Patched kernel:
Read B/W: 873,224 KB/s
Read Merges: 2,046K
I considered several approaches to solving the problem:
1) get rid of the inner-most plugs
2) handle nesting by using only one on-stack plug
2a) #2, except use a per-cpu blk_plug struct, which may clean up the
code a bit at the expense of memory footprint
Option 1 will be tricky or impossible to do, since inner most plug
lists are sometimes the only plug lists, depending on the call path.
Option 2 is what this patch implements. Option 2a is perhaps a better
idea, but since I already implemented option 2, I figured I'd post it
for comments and opinions before rewriting it.
Much of the patch involves modifying call sites to blk_start_plug,
since its signature is changed. The meat of the patch is actually
pretty simple and constrained to block/blk-core.c and
include/linux/blkdev.h. The only tricky bits were places where plugs
were finished and then restarted to flush out I/O. There, I went
ahead and exported blk_flush_plug_list and called that directly.
Comments would be greatly appreciated.