Re: ublk-qcow2: ublk-qcow2 is available

From: Denis V. Lunev
Date: Mon Oct 03 2022 - 19:58:11 EST

On 10/3/22 21:53, Stefan Hajnoczi wrote:
On Fri, Sep 30, 2022 at 05:24:11PM +0800, Ming Lei wrote:
ublk-qcow2 is available now.
Cool, thanks for sharing!

So far it provides basic read/write function, and compression and snapshot
aren't supported yet. The target/backend implementation is completely
based on io_uring, and share the same io_uring with ublk IO command
handler, just like what ublk-loop does.

Follows the main motivations of ublk-qcow2:

- building one complicated target from scratch helps libublksrv APIs/functions
become mature/stable more quickly, since qcow2 is complicated and needs more
requirement from libublksrv compared with other simple ones(loop, null)

- there are several attempts of implementing qcow2 driver in kernel, such as
``qloop`` [2], ``dm-qcow2`` [3] and ``in kernel qcow2(ro)`` [4], so ublk-qcow2
might useful be for covering requirement in this field
There is one important thing to keep in mind about all partly-userspace
implementations though:
* any single allocation happened in the context of the
   userspace daemon through try_to_free_pages() in
   kernel has a possibility to trigger the operation,
   which will require userspace daemon action, which
   is inside the kernel now.
* the probability of this is higher in the overcommitted

This was the main motivation of us in favor for the in-kernel

- performance comparison with qemu-nbd, and it was my 1st thought to evaluate
performance of ublk/io_uring backend by writing one ublk-qcow2 since ublksrv
is started

- help to abstract common building block or design pattern for writing new ublk

So far it basically passes xfstest(XFS) test by using ublk-qcow2 block
device as TEST_DEV, and kernel building workload is verified too. Also
soft update approach is applied in meta flushing, and meta data
integrity is guaranteed, 'make test T=qcow2/040' covers this kind of
test, and only cluster leak is reported during this test.

The performance data looks much better compared with qemu-nbd, see
details in commit log[1], README[5] and STATUS[6]. And the test covers both
empty image and pre-allocated image, for example of pre-allocated qcow2

- qemu-nbd (make test T=qcow2/002)
Single queue?

randwrite(4k): jobs 1, iops 24605
randread(4k): jobs 1, iops 30938
randrw(4k): jobs 1, iops read 13981 write 14001
rw(512k): jobs 1, iops read 724 write 728
Please try qemu-storage-daemon's VDUSE export type as well. The
command-line should be similar to this:

# modprobe virtio_vdpa # attaches vDPA devices to host kernel
# modprobe vduse
# qemu-storage-daemon \
--blockdev file,filename=test.qcow2,|off,aio=native,node-name=file \
--blockdev qcow2,file=file,node-name=qcow2 \
--object iothread,id=iothread0 \
--export vduse-blk,id=vduse0,name=vduse0,num-queues=$(nproc),node-name=qcow2,writable=on,iothread=iothread0
# vdpa dev add name vduse0 mgmtdev vduse

A virtio-blk device should appear and xfstests can be run on it
(typically /dev/vda unless you already have other virtio-blk devices).

Afterwards you can destroy the device using:

# vdpa dev del vduse0
but this would be anyway limited by a single thread doing AIO in
qemu-storage-daemon, I believe.

- ublk-qcow2 (make test T=qcow2/022)
There are a lot of other factors not directly related to NBD vs ublk. In
order to get an apples-to-apples comparison with qemu-* a ublk export
type is needed in qemu-storage-daemon. That way only the difference is
the ublk interface and the rest of the code path is identical, making it
possible to compare NBD, VDUSE, ublk, etc more precisely.

I think that comparison is interesting before comparing different qcow2
implementations because qcow2 sits on top of too much other code. It's
hard to know what should be accounted to configuration differences,
implementation differences, or fundamental differences that cannot be
overcome (this is the interesting part!).

randwrite(4k): jobs 1, iops 104481
randread(4k): jobs 1, iops 114937
randrw(4k): jobs 1, iops read 53630 write 53577
rw(512k): jobs 1, iops read 1412 write 1423

Also ublk-qcow2 aligns queue's chunk_sectors limit with qcow2's cluster size,
which is 64KB at default, this way simplifies backend io handling, but
it could be increased to 512K or more proper size for improving sequential
IO perf, just need one coroutine to handle more than one IOs.