Re: [RFC] vhost: introduce mdev based hardware vhost backend

From: Tiwei Bie
Date: Tue Apr 10 2018 - 00:59:36 EST


On Tue, Apr 10, 2018 at 10:52:52AM +0800, Jason Wang wrote:
> On 2018å04æ02æ 23:23, Tiwei Bie wrote:
> > This patch introduces a mdev (mediated device) based hardware
> > vhost backend. This backend is an abstraction of the various
> > hardware vhost accelerators (potentially any device that uses
> > virtio ring can be used as a vhost accelerator). Some generic
> > mdev parent ops are provided for accelerator drivers to support
> > generating mdev instances.
> >
> > What's this
> > ===========
> >
> > The idea is that we can setup a virtio ring compatible device
> > with the messages available at the vhost-backend. Originally,
> > these messages are used to implement a software vhost backend,
> > but now we will use these messages to setup a virtio ring
> > compatible hardware device. Then the hardware device will be
> > able to work with the guest virtio driver in the VM just like
> > what the software backend does. That is to say, we can implement
> > a hardware based vhost backend in QEMU, and any virtio ring
> > compatible devices potentially can be used with this backend.
> > (We also call it vDPA -- vhost Data Path Acceleration).
> >
> > One problem is that, different virtio ring compatible devices
> > may have different device interfaces. That is to say, we will
> > need different drivers in QEMU. It could be troublesome. And
> > that's what this patch trying to fix. The idea behind this
> > patch is very simple: mdev is a standard way to emulate device
> > in kernel.
>
> So you just move the abstraction layer from qemu to kernel, and you still
> need different drivers in kernel for different device interfaces of
> accelerators. This looks even more complex than leaving it in qemu. As you
> said, another idea is to implement userspace vhost backend for accelerators
> which seems easier and could co-work with other parts of qemu without
> inventing new type of messages.

I'm not quite sure. Do you think it's acceptable to
add various vendor specific hardware drivers in QEMU?

>
> Need careful thought here to seek a best solution here.

Yeah, definitely! :)
And your opinions would be very helpful!

>
> > So we defined a standard device based on mdev, which
> > is able to accept vhost messages. When the mdev emulation code
> > (i.e. the generic mdev parent ops provided by this patch) gets
> > vhost messages, it will parse and deliver them to accelerator
> > drivers. Drivers can use these messages to setup accelerators.
> >
> > That is to say, the generic mdev parent ops (e.g. read()/write()/
> > ioctl()/...) will be provided for accelerator drivers to register
> > accelerators as mdev parent devices. And each accelerator device
> > will support generating standard mdev instance(s).
> >
> > With this standard device interface, we will be able to just
> > develop one userspace driver to implement the hardware based
> > vhost backend in QEMU.
> >
> > Difference between vDPA and PCI passthru
> > ========================================
> >
> > The key difference between vDPA and PCI passthru is that, in
> > vDPA only the data path of the device (e.g. DMA ring, notify
> > region and queue interrupt) is pass-throughed to the VM, the
> > device control path (e.g. PCI configuration space and MMIO
> > regions) is still defined and emulated by QEMU.
> >
> > The benefits of keeping virtio device emulation in QEMU compared
> > with virtio device PCI passthru include (but not limit to):
> >
> > - consistent device interface for guest OS in the VM;
> > - max flexibility on the hardware design, especially the
> > accelerator for each vhost backend doesn't have to be a
> > full PCI device;
> > - leveraging the existing virtio live-migration framework;
> >
> > The interface of this mdev based device
> > =======================================
> >
> > 1. BAR0
> >
> > The MMIO region described by BAR0 is the main control
> > interface. Messages will be written to or read from
> > this region.
> >
> > The message type is determined by the `request` field
> > in message header. The message size is encoded in the
> > message header too. The message format looks like this:
> >
> > struct vhost_vfio_op {
> > __u64 request;
> > __u32 flags;
> > /* Flag values: */
> > #define VHOST_VFIO_NEED_REPLY 0x1 /* Whether need reply */
> > __u32 size;
> > union {
> > __u64 u64;
> > struct vhost_vring_state state;
> > struct vhost_vring_addr addr;
> > struct vhost_memory memory;
> > } payload;
> > };
> >
> > The existing vhost-kernel ioctl cmds are reused as
> > the message requests in above structure.
> >
> > Each message will be written to or read from this
> > region at offset 0:
> >
> > int vhost_vfio_write(struct vhost_dev *dev, struct vhost_vfio_op *op)
> > {
> > int count = VHOST_VFIO_OP_HDR_SIZE + op->size;
> > struct vhost_vfio *vfio = dev->opaque;
> > int ret;
> >
> > ret = pwrite64(vfio->device_fd, op, count, vfio->bar0_offset);
> > if (ret != count)
> > return -1;
> >
> > return 0;
> > }
> >
> > int vhost_vfio_read(struct vhost_dev *dev, struct vhost_vfio_op *op)
> > {
> > int count = VHOST_VFIO_OP_HDR_SIZE + op->size;
> > struct vhost_vfio *vfio = dev->opaque;
> > uint64_t request = op->request;
> > int ret;
> >
> > ret = pread64(vfio->device_fd, op, count, vfio->bar0_offset);
> > if (ret != count || request != op->request)
> > return -1;
> >
> > return 0;
> > }
> >
> > It's quite straightforward to set things to the device.
> > Just need to write the message to device directly:
> >
> > int vhost_vfio_set_features(struct vhost_dev *dev, uint64_t features)
> > {
> > struct vhost_vfio_op op;
> >
> > op.request = VHOST_SET_FEATURES;
> > op.flags = 0;
> > op.size = sizeof(features);
> > op.payload.u64 = features;
> >
> > return vhost_vfio_write(dev, &op);
> > }
> >
> > To get things from the device, two steps are needed.
> > Take VHOST_GET_FEATURE as an example:
> >
> > int vhost_vfio_get_features(struct vhost_dev *dev, uint64_t *features)
> > {
> > struct vhost_vfio_op op;
> > int ret;
> >
> > op.request = VHOST_GET_FEATURES;
> > op.flags = VHOST_VFIO_NEED_REPLY;
> > op.size = 0;
> >
> > /* Just need to write the header */
> > ret = vhost_vfio_write(dev, &op);
> > if (ret != 0)
> > goto out;
> >
> > /* `op` wasn't changed during write */
> > op.flags = 0;
> > op.size = sizeof(*features);
> >
> > ret = vhost_vfio_read(dev, &op);
> > if (ret != 0)
> > goto out;
> >
> > *features = op.payload.u64;
> > out:
> > return ret;
> > }
> >
> > 2. BAR1 (mmap-able)
> >
> > The MMIO region described by BAR1 will be used to notify the
> > device.
> >
> > Each queue will has a page for notification, and it can be
> > mapped to VM (if hardware also supports), and the virtio
> > driver in the VM will be able to notify the device directly.
> >
> > The MMIO region described by BAR1 is also write-able. If the
> > accelerator's notification register(s) cannot be mapped to the
> > VM, write() can also be used to notify the device. Something
> > like this:
> >
> > void notify_relay(void *opaque)
> > {
> > ......
> > offset = 0x1000 * queue_idx; /* XXX assume page size is 4K here. */
> >
> > ret = pwrite64(vfio->device_fd, &queue_idx, sizeof(queue_idx),
> > vfio->bar1_offset + offset);
> > ......
> > }
> >
> > Other BARs are reserved.
> >
> > 3. VFIO interrupt ioctl API
> >
> > VFIO interrupt ioctl API is used to setup device interrupts.
> > IRQ-bypass will also be supported.
> >
> > Currently, only VFIO_PCI_MSIX_IRQ_INDEX is supported.
> >
> > The API for drivers to provide mdev instances
> > =============================================
> >
> > The read()/write()/ioctl()/mmap()/open()/release() mdev
> > parent ops have been provided for accelerators' drivers
> > to provide mdev instances.
> >
> > ssize_t vdpa_read(struct mdev_device *mdev, char __user *buf,
> > size_t count, loff_t *ppos);
> > ssize_t vdpa_write(struct mdev_device *mdev, const char __user *buf,
> > size_t count, loff_t *ppos);
> > long vdpa_ioctl(struct mdev_device *mdev, unsigned int cmd, unsigned long arg);
> > int vdpa_mmap(struct mdev_device *mdev, struct vm_area_struct *vma);
> > int vdpa_open(struct mdev_device *mdev);
> > void vdpa_close(struct mdev_device *mdev);
> >
> > Each accelerator driver just needs to implement its own
> > create()/remove() ops, and provide a vdpa device ops
> > which will be called by the generic mdev emulation code.
> >
> > Currently, the vdpa device ops are defined as:
> >
> > typedef int (*vdpa_start_device_t)(struct vdpa_dev *vdpa);
> > typedef int (*vdpa_stop_device_t)(struct vdpa_dev *vdpa);
> > typedef int (*vdpa_dma_map_t)(struct vdpa_dev *vdpa);
> > typedef int (*vdpa_dma_unmap_t)(struct vdpa_dev *vdpa);
> > typedef int (*vdpa_set_eventfd_t)(struct vdpa_dev *vdpa, int vector, int fd);
> > typedef u64 (*vdpa_supported_features_t)(struct vdpa_dev *vdpa);
> > typedef void (*vdpa_notify_device_t)(struct vdpa_dev *vdpa, int qid);
> > typedef u64 (*vdpa_get_notify_addr_t)(struct vdpa_dev *vdpa, int qid);
> >
> > struct vdpa_device_ops {
> > vdpa_start_device_t start;
> > vdpa_stop_device_t stop;
> > vdpa_dma_map_t dma_map;
> > vdpa_dma_unmap_t dma_unmap;
> > vdpa_set_eventfd_t set_eventfd;
> > vdpa_supported_features_t supported_features;
> > vdpa_notify_device_t notify;
> > vdpa_get_notify_addr_t get_notify_addr;
> > };
> >
> > struct vdpa_dev {
> > struct mdev_device *mdev;
> > struct mutex ops_lock;
> > u8 vconfig[VDPA_CONFIG_SIZE];
> > int nr_vring;
> > u64 features;
> > u64 state;
> > struct vhost_memory *mem_table;
> > bool pending_reply;
> > struct vhost_vfio_op pending;
> > const struct vdpa_device_ops *ops;
> > void *private;
> > int max_vrings;
> > struct vdpa_vring_info vring_info[0];
> > };
> >
> > struct vdpa_dev *vdpa_alloc(struct mdev_device *mdev, void *private,
> > int max_vrings);
> > void vdpa_free(struct vdpa_dev *vdpa);
> >
> > A simple example
> > ================
> >
> > # Query the number of available mdev instances
> > $ cat /sys/class/mdev_bus/0000:06:00.2/mdev_supported_types/ifcvf_vdpa-vdpa_virtio/available_instances
> >
> > # Create a mdev instance
> > $ echo $UUID > /sys/class/mdev_bus/0000:06:00.2/mdev_supported_types/ifcvf_vdpa-vdpa_virtio/create
> >
> > # Launch QEMU with a virtio-net device
> > $ qemu \
> > ...... \
> > -netdev type=vhost-vfio,sysfsdev=/sys/bus/mdev/devices/$UUID,id=$ID \
> > -device virtio-net-pci,netdev=$ID
> >
> > -------- END --------
> >
> > Most of above words will be refined and moved to a doc in
> > the formal patch. In this RFC, all introductions and code
> > are gathered in this patch, the idea is to make it easier
> > to find all the relevant information. Anyone who wants to
> > comment could use inline comment and just keep the relevant
> > parts. Sorry for the big RFC patch..
> >
> > This patch is just a RFC for now, and something is still
> > missing or needs to be refined. But it's never too early
> > to hear the thoughts from the community. So any comments
> > would be appreciated! Thanks! :-)
>
> I don't see vhost_vfio_write() and other above functions in the patch. Looks
> like some part of the patch is missed, it would be better to post a complete
> series with an example driver (vDPA) to get a full picture.

No problem. We will send out the QEMU changes soon!

Thanks!

>
> Thanks
>
[...]