Re: [Xen-devel] [PATCH RFC v2 0/5] Multi-queue support for xen-blkfront and xen-blkback

From: Jens Axboe
Date: Tue Aug 11 2015 - 13:33:11 EST

On 08/11/2015 03:45 AM, Rafal Mielniczuk wrote:
On 11/08/15 07:08, Bob Liu wrote:
On 08/10/2015 11:52 PM, Jens Axboe wrote:
On 08/10/2015 05:03 AM, Rafal Mielniczuk wrote:
On 01/07/15 04:03, Jens Axboe wrote:
On 06/30/2015 08:21 AM, Marcus Granado wrote:

Our measurements for the multiqueue patch indicate a clear improvement
in iops when more queues are used.

The measurements were obtained under the following conditions:

- using blkback as the dom0 backend with the multiqueue patch applied to
a dom0 kernel 4.0 on 8 vcpus.

- using a recent Ubuntu 15.04 kernel 3.19 with multiqueue frontend
applied to be used as a guest on 4 vcpus

- using a micron RealSSD P320h as the underlying local storage on a Dell
PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.

- fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest.
We used direct_io to skip caching in the guest and ran fio for 60s
reading a number of block sizes ranging from 512 bytes to 4MiB. Queue
depth of 32 for each queue was used to saturate individual vcpus in the

We were interested in observing storage iops for different values of
block sizes. Our expectation was that iops would improve when increasing
the number of queues, because both the guest and dom0 would be able to
make use of more vcpus to handle these requests.

These are the results (as aggregate iops for all the fio threads) that
we got for the conditions above with sequential reads:

fio_threads io_depth block_size 1-queue_iops 8-queue_iops
8 32 512 158K 264K
8 32 1K 157K 260K
8 32 2K 157K 258K
8 32 4K 148K 257K
8 32 8K 124K 207K
8 32 16K 84K 105K
8 32 32K 50K 54K
8 32 64K 24K 27K
8 32 128K 11K 13K

8-queue iops was better than single queue iops for all the block sizes.
There were very good improvements as well for sequential writes with
block size 4K (from 80K iops with single queue to 230K iops with 8
queues), and no regressions were visible in any measurement performed.
Great results! And I don't know why this code has lingered for so long,
so thanks for helping get some attention to this again.

Personally I'd be really interested in the results for the same set of
tests, but without the blk-mq patches. Do you have them, or could you
potentially run them?


We rerun the tests for sequential reads with the identical settings but with Bob Liu's multiqueue patches reverted from dom0 and guest kernels.
The results we obtained were *better* than the results we got with multiqueue patches applied:

fio_threads io_depth block_size 1-queue_iops 8-queue_iops *no-mq-patches_iops*
8 32 512 158K 264K 321K
8 32 1K 157K 260K 328K
8 32 2K 157K 258K 336K
8 32 4K 148K 257K 308K
8 32 8K 124K 207K 188K
8 32 16K 84K 105K 82K
8 32 32K 50K 54K 36K
8 32 64K 24K 27K 16K
8 32 128K 11K 13K 11K

We noticed that the requests are not merged by the guest when the multiqueue patches are applied,
which results in a regression for small block sizes (RealSSD P320h's optimal block size is around 32-64KB).

We observed similar regression for the Dell MZ-5EA1000-0D3 100 GB 2.5" Internal SSD

As I understand blk-mq layer bypasses I/O scheduler which also effectively disables merges.
Could you explain why it is difficult to enable merging in the blk-mq layer?
That could help closing the performance gap we observed.

Otherwise, the tests shows that the multiqueue patches does not improve the performance,
at least when it comes to sequential read/writes operations.
blk-mq still provides merging, there should be no difference there. Does the xen patches set BLK_MQ_F_SHOULD_MERGE?

Is it possible that xen-blkfront driver dequeue requests too fast after we have multiple hardware queues?
Because new requests don't have the chance merging with old requests which were already dequeued and issued.

For some reason we don't see merges even when we set multiqueue to 1.
Below are some stats from the guest system when doing sequential 4KB reads:

$ fio --name=test --ioengine=libaio --direct=1 --rw=read --numjobs=8
--iodepth=32 --time_based=1 --runtime=300 --bs=4KB

$ iostat -xt 5 /dev/xvdb
avg-cpu: %user %nice %system %iowait %steal %idle
0.50 0.00 2.73 85.14 2.00 9.63

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
xvdb 0.00 0.00 156926.00 0.00 627704.00 0.00
8.00 30.06 0.19 0.19 0.00 0.01 100.48

$ cat /sys/block/xvdb/queue/scheduler

$ cat /sys/block/xvdb/queue/nomerges

Relevant bits from the xenstore configuration on the dom0:

/local/domain/0/backend/vbd/2/51728/dev = "xvdb"
/local/domain/0/backend/vbd/2/51728/backend-kind = "vbd"
/local/domain/0/backend/vbd/2/51728/type = "phy"
/local/domain/0/backend/vbd/2/51728/multi-queue-max-queues = "1"

/local/domain/2/device/vbd/51728/multi-queue-num-queues = "1"
/local/domain/2/device/vbd/51728/ring-ref = "9"
/local/domain/2/device/vbd/51728/event-channel = "60"

If you add --iodepth-batch=16 to that fio command line? Both mq and non-mq relies on plugging to get batching in the use case above, otherwise IO is dispatched immediately. O_DIRECT is immediate. I'd be more interested in seeing a test case with buffered IO of a file system on top of the xvdb device, if we're missing merging for that case, then that's a much bigger issue.

Jens Axboe

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at