Re: [PATCH 2/2] blk-plug: don't flush nested plug lists

From: Jens Axboe
Date: Wed Apr 08 2015 - 13:38:37 EST


On 04/06/2015 01:14 PM, Jeff Moyer wrote:
The way the on-stack plugging currently works, each nesting level
flushes its own list of I/Os. This can be less than optimal (read
awful) for certain workloads. For example, consider an application
that issues asynchronous O_DIRECT I/Os. It can send down a bunch of
I/Os together in a single io_submit call, only to have each of them
dispatched individually down in the bowells of the dirct I/O code.
The reason is that there are blk_plug's instantiated both at the upper
call site in do_io_submit and down in do_direct_IO. The latter will
submit as little as 1 I/O at a time (if you have a small enough I/O
size) instead of performing the batching that the plugging
infrastructure is supposed to provide.

Now, for the case where there is an elevator involved, this doesn't
really matter too much. The elevator will keep the I/O around long
enough for it to be merged. However, in cases where there is no
elevator (like blk-mq), I/Os are simply dispatched immediately.

Try this, for example (note I'm using a virtio-blk device, so it's
using the blk-mq single queue path, though I've also reproduced this
with the micron p320h):

fio --rw=read --bs=4k --iodepth=128 --iodepth_batch=16 --iodepth_batch_complete=16 --runtime=10s --direct=1 --filename=/dev/vdd --name=job1 --ioengine=libaio --time_based

If you run that on a current kernel, you will get zero merges. Zero!
After this patch, you will get many merges (the actual number depends
on how fast your storage is, obviously), and much better throughput.
Here are results from my test rig:

Unpatched kernel:
Read B/W: 283,638 KB/s
Read Merges: 0

Patched kernel:
Read B/W: 873,224 KB/s
Read Merges: 2,046K

I considered several approaches to solving the problem:
1) get rid of the inner-most plugs
2) handle nesting by using only one on-stack plug
2a) #2, except use a per-cpu blk_plug struct, which may clean up the
code a bit at the expense of memory footprint

Option 1 will be tricky or impossible to do, since inner most plug
lists are sometimes the only plug lists, depending on the call path.
Option 2 is what this patch implements. Option 2a is perhaps a better
idea, but since I already implemented option 2, I figured I'd post it
for comments and opinions before rewriting it.

Much of the patch involves modifying call sites to blk_start_plug,
since its signature is changed. The meat of the patch is actually
pretty simple and constrained to block/blk-core.c and
include/linux/blkdev.h. The only tricky bits were places where plugs
were finished and then restarted to flush out I/O. There, I went
ahead and exported blk_flush_plug_list and called that directly.

Comments would be greatly appreciated.

It's hard to argue with the increased merging for your case. The task plugs did originally work like you changed them to, not flushing until the outermost plug was flushed. Unfortunately I don't quite remember why I changed them, will have to do a bit of digging to refresh my memory.

For cases where we don't do any merging (like nvme), we always want to flush. Well almost, if we start do utilize the batched submission, then the plug would still potentially help (just for other reasons than merging).

And agree with Ming, this can be cleaned up substantially. I'd also like to see some test results from the other end of the spectrum. Your posted cased is clearly based case (we missed tons of merging, now we don't), I'd like to see a normal case and a worst case result as well so we have an idea of what this would do to latencies.

--
Jens Axboe

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/