RE: [RFC PATCH 6/6] drm/i915/gvt: support QEMU getting the dmabuf

From: Chen, Xiaoguang
Date: Wed May 17 2017 - 21:52:02 EST


Hi Alex,

>-----Original Message-----
>From: Alex Williamson [mailto:alex.williamson@xxxxxxxxxx]
>Sent: Thursday, May 18, 2017 5:44 AM
>To: Chen, Xiaoguang <xiaoguang.chen@xxxxxxxxx>
>Cc: Gerd Hoffmann <kraxel@xxxxxxxxxx>; Tian, Kevin <kevin.tian@xxxxxxxxx>;
>linux-kernel@xxxxxxxxxxxxxxx; zhenyuw@xxxxxxxxxxxxxxx; Lv, Zhiyuan
><zhiyuan.lv@xxxxxxxxx>; intel-gvt-dev@xxxxxxxxxxxxxxxxxxxxx; Wang, Zhi A
><zhi.a.wang@xxxxxxxxx>
>Subject: Re: [RFC PATCH 6/6] drm/i915/gvt: support QEMU getting the dmabuf
>
>On Tue, 16 May 2017 10:16:28 +0000
>"Chen, Xiaoguang" <xiaoguang.chen@xxxxxxxxx> wrote:
>
>> Hi Alex,
>>
>> >-----Original Message-----
>> >From: Alex Williamson [mailto:alex.williamson@xxxxxxxxxx]
>> >Sent: Tuesday, May 16, 2017 1:44 AM
>> >To: Chen, Xiaoguang <xiaoguang.chen@xxxxxxxxx>
>> >Cc: Gerd Hoffmann <kraxel@xxxxxxxxxx>; Tian, Kevin
>> ><kevin.tian@xxxxxxxxx>; intel-gfx@xxxxxxxxxxxxxxxxxxxxx;
>> >linux-kernel@xxxxxxxxxxxxxxx; zhenyuw@xxxxxxxxxxxxxxx; Lv, Zhiyuan
>> ><zhiyuan.lv@xxxxxxxxx>; intel-gvt- dev@xxxxxxxxxxxxxxxxxxxxx; Wang,
>> >Zhi A <zhi.a.wang@xxxxxxxxx>
>> >Subject: Re: [RFC PATCH 6/6] drm/i915/gvt: support QEMU getting the
>> >dmabuf
>> >
>> >On Mon, 15 May 2017 03:36:50 +0000
>> >"Chen, Xiaoguang" <xiaoguang.chen@xxxxxxxxx> wrote:
>> >
>> >> Hi Alex and Gerd,
>> >>
>> >> >-----Original Message-----
>> >> >From: Alex Williamson [mailto:alex.williamson@xxxxxxxxxx]
>> >> >Sent: Saturday, May 13, 2017 12:38 AM
>> >> >To: Gerd Hoffmann <kraxel@xxxxxxxxxx>
>> >> >Cc: Chen, Xiaoguang <xiaoguang.chen@xxxxxxxxx>; Tian, Kevin
>> >> ><kevin.tian@xxxxxxxxx>; intel-gfx@xxxxxxxxxxxxxxxxxxxxx; linux-
>> >> >kernel@xxxxxxxxxxxxxxx; zhenyuw@xxxxxxxxxxxxxxx; Lv, Zhiyuan
>> >> ><zhiyuan.lv@xxxxxxxxx>; intel-gvt-dev@xxxxxxxxxxxxxxxxxxxxx; Wang,
>> >> >Zhi A <zhi.a.wang@xxxxxxxxx>
>> >> >Subject: Re: [RFC PATCH 6/6] drm/i915/gvt: support QEMU getting
>> >> >the dmabuf
>> >> >
>> >> >On Fri, 12 May 2017 11:12:05 +0200 Gerd Hoffmann
>> >> ><kraxel@xxxxxxxxxx> wrote:
>> >> >
>> >> >> Hi,
>> >> >>
>> >> >> > If the contents of the framebuffer change or if the parameters
>> >> >> > of the framebuffer change? I can't image that creating a new
>> >> >> > dmabuf fd for every visual change within the framebuffer would
>> >> >> > be efficient, but I don't have any concept of what a dmabuf actually
>does.
>> >> >>
>> >> >> Ok, some background:
>> >> >>
>> >> >> The drm subsystem has the concept of planes. The most important
>> >> >> plane is the primary framebuffer (i.e. what gets scanned out to
>> >> >> the physical display). The cursor is a plane too, and there can
>> >> >> be additional overlay planes for stuff like video playback.
>> >> >>
>> >> >> Typically there are multiple planes in a system and only one of
>> >> >> them gets scanned out to the crtc, i.e. the fbdev emulation
>> >> >> creates one plane for the framebuffer console. The X-Server
>> >> >> creates a plane too, and when you switch between X-Server and
>> >> >> framebuffer console via ctrl-alt-fn the intel driver just
>> >> >> reprograms the encoder to scan out the one or the other plane to the crtc.
>> >> >>
>> >> >> The dma-buf handed out by gvt is a reference to a plane. I
>> >> >> think on the host side gvt can see only the active plane (from
>> >> >> encoder/crtc register
>> >> >> programming) not the inactive ones.
>> >> >>
>> >> >> The dma-buf can be imported as opengl texture and then be used
>> >> >> to render the guest display to a host window. I think it is
>> >> >> even possible to use the dma-buf as plane in the host drm driver
>> >> >> and scan it out directly to a physical display. The actual
>> >> >> framebuffer content stays in gpu memory all the time, the cpu never has
>to touch it.
>> >> >>
>> >> >> It is possible to cache the dma-buf handles, i.e. when the guest
>> >> >> boots you'll get the first for the fbcon plane, when the
>> >> >> x-server starts the second for the x-server framebuffer, and
>> >> >> when the user switches to the text console via ctrl-alt-fn you
>> >> >> can re-use the fbcon dma-buf you already have.
>> >> >>
>> >> >> The caching becomes more important for good performance when the
>> >> >> guest uses pageflipping (wayland does): define two planes,
>> >> >> render into one while displaying the other, then flip the two
>> >> >> for a atomic display update.
>> >> >>
>> >> >> The caching also makes it a bit difficult to create a good interface.
>> >> >> So, the current patch set creates:
>> >> >>
>> >> >> (a) A way to query the active planes (ioctl
>> >> >> INTEL_VGPU_QUERY_DMABUF added by patch 5/6 of this series).
>> >> >> (b) A way to create a dma-buf for the active plane (ioctl
>> >> >> INTEL_VGPU_GENERATE_DMABUF).
>> >> >>
>> >> >> Typical userspace workflow is to first query the plane, then
>> >> >> check if it already has a dma-buf for it, and if not create one.
>> >> >
>> >> >Thank you! This is immensely helpful!
>> >> >
>> >> >> > What changes to the framebuffer require a new dmabuf fd?
>> >> >> > Shouldn't the user query the parameters of the framebuffer
>> >> >> > through a dmabuf fd and shouldn't the dmabuf fd have some
>> >> >> > signaling mechanism to the user (eventfd perhaps) to notify
>> >> >> > the user to re-
>> >evaluate the parameters?
>> >> >>
>> >> >> dma-bufs don't support that, they are really just a handle to a
>> >> >> piece of memory, all metadata (format, size) most be
>> >> >> communicated by
>> >other means.
>> >> >>
>> >> >> > Otherwise are you imagining that the user polls the vfio region?
>> >> >>
>> >> >> Hmm, notification support would probably a good reason to have a
>> >> >> separate file handle to manage the dma-bufs (instead of using
>> >> >> driver-specific ioctls on the vfio fd), because the driver could
>> >> >> also use the management fd for notifications then.
>> >> >
>> >> >I like this idea of a separate control fd for dmabufs, it provides
>> >> >not only a central management point, but also a nice abstraction
>> >> >for the vfio device specific interface. We potentially only need
>> >> >a single
>> >> >VFIO_DEVICE_GET_DMABUF_MGR_FD() ioctl to get a dmabuf management
>> >> >fd (perhaps with a type parameter, ex. GFX) where maybe we could
>> >> >have vfio-core incorporate this reference into the group
>> >> >lifecycle, so the vendor driver only needs to fdget/put this
>> >> >manager fd for the various plane dmabuf fds spawned in order to get core-
>level reference counting.
>> >> Following is my understanding of the management fd idea:
>> >> 1) QEMU will call VFIO_DEVICE_GET_DMABUF_MGR_FD() ioctl to create a
>> >> fd
>> >and saved the fd in vfio group while initializing the vfio.
>> >
>> >Ideally there'd be kernel work here too if we want vfio-core to
>> >incorporate lifecycle of this fd into the device/group/container
>> >lifecycle. Maybe we even want to generalize it further to something
>> >like VFIO_DEVICE_GET_FD which takes a parameter of what type of FD to
>> >get, GFX_DMABUF_MGR_FD in this case. vfio- core would probably
>> >allocate the fd, tap into the release hook for reference counting and
>> >pass it to the vfio_device_ops (mdev vendor driver in this case) to attach
>further.
>> I tried to implement this today and now it functionally worked.
>> I am still a little confuse of how to tap the fd into the release hook of
>device/group/container.
>> I tried to create the fd in vfio core but found it is difficult to get the file
>operations for the fd, the file operations should be supplied by vendor drivers.
>
>I'm not fully convinced there's benefit to having vfio-core attempt to do this, I
>was just hoping to avoid each vendor driver needing to implement their own
>reference counting. I don't think vfio-core wants to get into tracking each ioctl
>for each vendor specific fd type to know which create new references and which
>don't. Perhaps we'll come up with easier ways to do this as we go.
Got it.

>
>> So the fd is created in kvmgt.c for now.
>> Below is part of the codes:
>>
>> diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c
>> b/drivers/gpu/drm/i915/gvt/kvmgt.c
>> index 389f072..d0649ba 100644
>> --- a/drivers/gpu/drm/i915/gvt/kvmgt.c
>> +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
>> @@ -41,6 +41,7 @@
>> #include <linux/kvm_host.h>
>> #include <linux/vfio.h>
>> #include <linux/mdev.h>
>> +#include <linux/anon_inodes.h>
>>
>> #include "i915_drv.h"
>> #include "gvt.h"
>> @@ -524,6 +525,63 @@ static int intel_vgpu_reg_init_opregion(struct
>intel_vgpu *vgpu)
>> return ret;
>> }
>>
>> +static int intel_vgpu_dmabuf_mgr_fd_mmap(struct file *file, struct
>> +vm_area_struct *vma) {
>> + WARN_ON(1);
>
>A user can abuse this, simply return error.
OK.

>
>> +
>> + return 0;
>> +}
>> +
>> +static int intel_vgpu_dmabuf_mgr_fd_release(struct inode *inode,
>> +struct file *filp) {
>> + struct intel_vgpu *vgpu = filp->private_data;
>> +
>> + if (vgpu->vdev.vfio_device != NULL)
>> + vfio_device_put(vgpu->vdev.vfio_device);
>
>When does the case occur where we don't have a vfio_device? This looks a bit
>like a warning flag that reference counting isn't handled properly.
This situation happen only when anonymous fd created successfully but error occur while trying to get the vfio_device.
We should return error while user space trying to create the management fd and print an error message while kernel release this fd.

>
>> +
>> + return 0;
>> +}
>> +
>> +static long intel_vgpu_dmabuf_mgr_fd_ioctl(struct file *filp,
>> + unsigned int ioctl, unsigned long arg) {
>> + struct intel_vgpu *vgpu = filp->private_data;
>> + int minsz;
>> + struct intel_vgpu_dmabuf dmabuf;
>> + int ret;
>> + struct fd f;
>> + f = fdget(dmabuf.fd);
>> + minsz = offsetofend(struct intel_vgpu_dmabuf, tiled);
>> + if (copy_from_user(&dmabuf, (void __user *)arg, minsz))
>> + return -EFAULT;
>> + if (ioctl == INTEL_VGPU_QUERY_DMABUF)
>> + ret = intel_gvt_ops->vgpu_query_dmabuf(vgpu, &dmabuf);
>> + else if (ioctl == INTEL_VGPU_GENERATE_DMABUF)
>> + ret = intel_gvt_ops->vgpu_generate_dmabuf(vgpu,
>> +&dmabuf);
>
>Why do we need vendor specific ioctls here? Aren't querying the current plane
>and getting an fd for that plane very generic concepts?
>Is the resulting dmabuf Intel specific?
No. not Intel specific. Like Gerd said "Typical userspace workflow is to first query the plane, then
check if it already has a dma-buf for it, and if not create one".
We first query the plane info(WITHOUT creating a fd).
User space need to check whether there's a dmabuf for the plane(user space usually cached two or three dmabuf to handle double buffer or triple buffer situation) only there's no dmabuf for the plane we will create a dmabuf for it(another ioctl).

>
>> + else {
>> + fdput(f);
>> + gvt_vgpu_err("unsupported dmabuf operation\n");
>> + return -EINVAL;
>> + }
>> +
>> + if (ret != 0) {
>> + fdput(f);
>> + gvt_vgpu_err("gvt-g get dmabuf failed:%d\n", ret);
>> + return -EINVAL;
>> + }
>> + fdput(f);
>> +
>> + return copy_to_user((void __user *)arg, &dmabuf, minsz) ?
>> +-EFAULT : 0; }
>> +
>> +static const struct file_operations intel_vgpu_dmabuf_mgr_fd_ops = {
>> + .release = intel_vgpu_dmabuf_mgr_fd_release,
>> + .unlocked_ioctl = intel_vgpu_dmabuf_mgr_fd_ioctl,
>> + .mmap = intel_vgpu_dmabuf_mgr_fd_mmap,
>> + .llseek = noop_llseek,
>> +};
>> +
>> static int intel_vgpu_create(struct kobject *kobj, struct mdev_device
>> *mdev) {
>> struct intel_vgpu *vgpu = NULL; @@ -1259,6 +1317,31 @@ static
>> long intel_vgpu_ioctl(struct mdev_device *mdev, unsigned int cmd,
>> } else if (cmd == VFIO_DEVICE_RESET) {
>> intel_gvt_ops->vgpu_reset(vgpu);
>> return 0;
>> + } else if (cmd == VFIO_DEVICE_GET_FD) {
>> + struct vfio_fd vfio_fd;
>> + int fd;
>> + struct vfio_device *device;
>> +
>> + minsz = offsetofend(struct vfio_fd, fd);
>> + if (copy_from_user(&vfio_fd, (void __user *)arg, minsz))
>> + return -EINVAL;
>> +
>> + if (vfio_fd.argsz < minsz)
>> + return -EINVAL;
>> +
>> + fd = anon_inode_getfd("vfio_dmabuf_mgr_fd",
>&intel_vgpu_dmabuf_mgr_fd_ops,
>> + vgpu, O_RDWR | O_CLOEXEC);
>> + if (fd < 0)
>> + return -EINVAL;
>> +
>> + vfio_fd.fd = fd;
>> + device = vfio_device_get_from_dev(mdev_dev(mdev));
>> + if (device == NULL)
>> + gvt_vgpu_err("kvmgt: vfio device is null\n");
>> + else
>> + vgpu->vdev.vfio_device = device;
>> +
>> + return copy_to_user((void __user *)arg, &vfio_fd,
>> + minsz) ? -EFAULT : 0;
>> }
>>
>> return 0;
>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> index 519eff3..98be2e0 100644
>> --- a/include/uapi/linux/vfio.h
>> +++ b/include/uapi/linux/vfio.h
>> @@ -484,6 +485,20 @@ struct vfio_pci_hot_reset {
>>
>> #define VFIO_DEVICE_PCI_HOT_RESET _IO(VFIO_TYPE, VFIO_BASE + 13)
>>
>> +/**
>> + * VFIO_DEVICE_GET_FD - _IOW(VFIO_TYPE, VFIO_BASE + 21, struct
>> +vfio_fd)
>> + *
>> + * Create a fd for a vfio device.
>> + * This fd can be used for various purpose.
>> + */
>> +struct vfio_fd {
>> + __u32 argsz;
>> + __u32 flags;
>> + /* out */
>> + __u32 fd;
>> +};
>> +#define VFIO_DEVICE_GET_FD _IO(VFIO_TYPE, VFIO_BASE + 14)
>
>
>The idea was that we pass some sort of type to VFIO_DEVICE_GET_FD, for
>instance we might ask for a DEVICE_FD_GRAPHICS_DMABUF and the vfio bus
>driver (mdev vendor driver) would test whether it supports that type of thing and
>either return an fd or error. We can return the fd the same way we do for
>VFIO_DEVICE_GET_FD. For instance the user should do something like:
>
>dmabuf_fd = ioctl(device_fd,
> VFIO_DEVICE_GET_FD, DEVICE_FD_GRAPHICS_DMABUF); if
>(dmabuf_fd < 0)
> /* not supported... */
>else
> /* do stuff */
OK. Got it.


>
>Thanks,
>Alex