Re: [PATCH v1 5/9] block: loop: convert to blk-mq

From: Maxim Patlasov
Date: Fri Aug 29 2014 - 07:14:57 EST


On 08/28/2014 06:06 AM, Ming Lei wrote:
On 8/28/14, Maxim Patlasov <mpatlasov@xxxxxxxxxxxxx> wrote:
On 08/21/2014 09:44 AM, Ming Lei wrote:
On Wed, Aug 20, 2014 at 4:50 AM, Jens Axboe <axboe@xxxxxxxxx> wrote:

Reworked a bit more:

http://git.kernel.dk/?p=linux-block.git;a=commit;h=a323185a761b9a54dc340d383695b4205ea258b6
One big problem of the commit is that it is basically a serialized
workqueue
because of single &hctx->run_work, and per-req work_struct has to be
used for concurrent implementation. So looks the approach isn't flexible
enough compared with doing that in driver, or any idea about how to fix
that?

I'm interested what's the price of handling requests in a separate
thread at large. I used the following fio script:

[global]
direct=1
bsrange=512-512
timeout=10
numjobs=1
ioengine=sync

filename=/dev/loop0 # or /dev/nullb0

[f1]
rw=randwrite

to compare the performance of:

1) /dev/loop0 of 3.17.0-rc1 with Ming's patches applied -- 11K iops
If you enable BLK_MQ_F_WQ_CONTEXT, it isn't strange to see this
result since blk-mq implements a serialized workqueue.

BLK_MQ_F_WQ_CONTEXT is not in 3.17.0-rc1, so I couldn't enable it.


2) the same as above, but call loop_queue_work() directly from
loop_queue_rq() -- 270K iops
3) /dev/nullb0 of 3.17.0-rc1 -- 380K iops
In my recent investigation and discussion with Jens, using workqueue
may introduce some regression for cases like loop over null_blk, tmpfs.

And 270K vs. 380K is a bit similar with my result, and it was observed that
context switch is increased by more than 50% with introducing workqueue.

The figures are similar, but the comparison is not. Both 270K and 380K refer to configurations where no extra context switch involved.


I will post V3 which will use previous kthread, with blk-mq & kernel aio, which
should make full use of blk-mq and kernel aio, and won't introduce regression
for cases like above.

That would be great!


Taking into account so big difference (11K vs. 270K), would it be worthy
to implement pure non-blocking version of aio_kernel_submit() returning
error if blocking needed? Then loop driver (or any other in-kernel user)
The kernel aio submit is very similar with user space's implementation,
except for block plug&unplug usage in user space aio submit path.

If it is blocked in aio_kernel_submit(), you should observe similar thing
with io_submit() too.

Yes, I agree. My point was that there is a room for optimization as my experiments demonstrate. The question is whether it's worthy to sophisticate kernel aio (and fs-specific code too) for the sake of that optimization.

In fact, in a simple case like block fs on top of loopback device on top of a file on another block fs, what kernel aio does for loopback driver is a subtle way of converting incoming bio-s to outgoing bio-s. In case you know where the image file is placed (e.g. by fiemap), such a conversion may be done with zero overhead and anything that makes the overhead noticeable is suspicious. And it is easy to imagine other use-cases when that extra context switch is avoidable.

Thanks,
Maxim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/