RE: [RFC PATCH 6/6] drm/i915/gvt: support QEMU getting the dmabuf

From: Chen, Xiaoguang
Date: Fri May 19 2017 - 02:24:43 EST




>-----Original Message-----
>From: Alex Williamson [mailto:alex.williamson@xxxxxxxxxx]
>Sent: Thursday, May 18, 2017 10:56 PM
>To: Chen, Xiaoguang <xiaoguang.chen@xxxxxxxxx>
>Cc: Gerd Hoffmann <kraxel@xxxxxxxxxx>; Tian, Kevin <kevin.tian@xxxxxxxxx>;
>linux-kernel@xxxxxxxxxxxxxxx; zhenyuw@xxxxxxxxxxxxxxx; Lv, Zhiyuan
><zhiyuan.lv@xxxxxxxxx>; intel-gvt-dev@xxxxxxxxxxxxxxxxxxxxx; Wang, Zhi A
><zhi.a.wang@xxxxxxxxx>
>Subject: Re: [RFC PATCH 6/6] drm/i915/gvt: support QEMU getting the dmabuf
>
>On Thu, 18 May 2017 01:51:38 +0000
>"Chen, Xiaoguang" <xiaoguang.chen@xxxxxxxxx> wrote:
>
>> Hi Alex,
>>
>> >-----Original Message-----
>> >From: Alex Williamson [mailto:alex.williamson@xxxxxxxxxx]
>> >Sent: Thursday, May 18, 2017 5:44 AM
>> >To: Chen, Xiaoguang <xiaoguang.chen@xxxxxxxxx>
>> >Cc: Gerd Hoffmann <kraxel@xxxxxxxxxx>; Tian, Kevin
>> ><kevin.tian@xxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx;
>> >zhenyuw@xxxxxxxxxxxxxxx; Lv, Zhiyuan <zhiyuan.lv@xxxxxxxxx>;
>> >intel-gvt-dev@xxxxxxxxxxxxxxxxxxxxx; Wang, Zhi A
>> ><zhi.a.wang@xxxxxxxxx>
>> >Subject: Re: [RFC PATCH 6/6] drm/i915/gvt: support QEMU getting the
>> >dmabuf
>> >
>> >On Tue, 16 May 2017 10:16:28 +0000
>> >"Chen, Xiaoguang" <xiaoguang.chen@xxxxxxxxx> wrote:
>> >
>> >> Hi Alex,
>> >>
>> >> >-----Original Message-----
>> >> >From: Alex Williamson [mailto:alex.williamson@xxxxxxxxxx]
>> >> >Sent: Tuesday, May 16, 2017 1:44 AM
>> >> >To: Chen, Xiaoguang <xiaoguang.chen@xxxxxxxxx>
>> >> >Cc: Gerd Hoffmann <kraxel@xxxxxxxxxx>; Tian, Kevin
>> >> ><kevin.tian@xxxxxxxxx>; intel-gfx@xxxxxxxxxxxxxxxxxxxxx;
>> >> >linux-kernel@xxxxxxxxxxxxxxx; zhenyuw@xxxxxxxxxxxxxxx; Lv, Zhiyuan
>> >> ><zhiyuan.lv@xxxxxxxxx>; intel-gvt- dev@xxxxxxxxxxxxxxxxxxxxx;
>> >> >Wang, Zhi A <zhi.a.wang@xxxxxxxxx>
>> >> >Subject: Re: [RFC PATCH 6/6] drm/i915/gvt: support QEMU getting
>> >> >the dmabuf
>> >> >
>> >> >On Mon, 15 May 2017 03:36:50 +0000 "Chen, Xiaoguang"
>> >> ><xiaoguang.chen@xxxxxxxxx> wrote:
>> >> >
>> >> >> Hi Alex and Gerd,
>> >> >>
>> >> >> >-----Original Message-----
>> >> >> >From: Alex Williamson [mailto:alex.williamson@xxxxxxxxxx]
>> >> >> >Sent: Saturday, May 13, 2017 12:38 AM
>> >> >> >To: Gerd Hoffmann <kraxel@xxxxxxxxxx>
>> >> >> >Cc: Chen, Xiaoguang <xiaoguang.chen@xxxxxxxxx>; Tian, Kevin
>> >> >> ><kevin.tian@xxxxxxxxx>; intel-gfx@xxxxxxxxxxxxxxxxxxxxx; linux-
>> >> >> >kernel@xxxxxxxxxxxxxxx; zhenyuw@xxxxxxxxxxxxxxx; Lv, Zhiyuan
>> >> >> ><zhiyuan.lv@xxxxxxxxx>; intel-gvt-dev@xxxxxxxxxxxxxxxxxxxxx;
>> >> >> >Wang, Zhi A <zhi.a.wang@xxxxxxxxx>
>> >> >> >Subject: Re: [RFC PATCH 6/6] drm/i915/gvt: support QEMU getting
>> >> >> >the dmabuf
>> >> >> >
>> >> >> >On Fri, 12 May 2017 11:12:05 +0200 Gerd Hoffmann
>> >> >> ><kraxel@xxxxxxxxxx> wrote:
>> >> >> >
>> >> >> >> Hi,
>> >> >> >>
>> >> >> >> > If the contents of the framebuffer change or if the
>> >> >> >> > parameters of the framebuffer change? I can't image that
>> >> >> >> > creating a new dmabuf fd for every visual change within the
>> >> >> >> > framebuffer would be efficient, but I don't have any
>> >> >> >> > concept of what a dmabuf actually
>> >does.
>> >> >> >>
>> >> >> >> Ok, some background:
>> >> >> >>
>> >> >> >> The drm subsystem has the concept of planes. The most
>> >> >> >> important plane is the primary framebuffer (i.e. what gets
>> >> >> >> scanned out to the physical display). The cursor is a plane
>> >> >> >> too, and there can be additional overlay planes for stuff like video
>playback.
>> >> >> >>
>> >> >> >> Typically there are multiple planes in a system and only one
>> >> >> >> of them gets scanned out to the crtc, i.e. the fbdev
>> >> >> >> emulation creates one plane for the framebuffer console. The
>> >> >> >> X-Server creates a plane too, and when you switch between
>> >> >> >> X-Server and framebuffer console via ctrl-alt-fn the intel
>> >> >> >> driver just reprograms the encoder to scan out the one or the other
>plane to the crtc.
>> >> >> >>
>> >> >> >> The dma-buf handed out by gvt is a reference to a plane. I
>> >> >> >> think on the host side gvt can see only the active plane
>> >> >> >> (from encoder/crtc register
>> >> >> >> programming) not the inactive ones.
>> >> >> >>
>> >> >> >> The dma-buf can be imported as opengl texture and then be
>> >> >> >> used to render the guest display to a host window. I think
>> >> >> >> it is even possible to use the dma-buf as plane in the host
>> >> >> >> drm driver and scan it out directly to a physical display.
>> >> >> >> The actual framebuffer content stays in gpu memory all the
>> >> >> >> time, the cpu never has
>> >to touch it.
>> >> >> >>
>> >> >> >> It is possible to cache the dma-buf handles, i.e. when the
>> >> >> >> guest boots you'll get the first for the fbcon plane, when
>> >> >> >> the x-server starts the second for the x-server framebuffer,
>> >> >> >> and when the user switches to the text console via
>> >> >> >> ctrl-alt-fn you can re-use the fbcon dma-buf you already have.
>> >> >> >>
>> >> >> >> The caching becomes more important for good performance when
>> >> >> >> the guest uses pageflipping (wayland does): define two
>> >> >> >> planes, render into one while displaying the other, then flip
>> >> >> >> the two for a atomic display update.
>> >> >> >>
>> >> >> >> The caching also makes it a bit difficult to create a good interface.
>> >> >> >> So, the current patch set creates:
>> >> >> >>
>> >> >> >> (a) A way to query the active planes (ioctl
>> >> >> >> INTEL_VGPU_QUERY_DMABUF added by patch 5/6 of this series).
>> >> >> >> (b) A way to create a dma-buf for the active plane (ioctl
>> >> >> >> INTEL_VGPU_GENERATE_DMABUF).
>> >> >> >>
>> >> >> >> Typical userspace workflow is to first query the plane, then
>> >> >> >> check if it already has a dma-buf for it, and if not create one.
>> >> >> >
>> >> >> >Thank you! This is immensely helpful!
>> >> >> >
>> >> >> >> > What changes to the framebuffer require a new dmabuf fd?
>> >> >> >> > Shouldn't the user query the parameters of the framebuffer
>> >> >> >> > through a dmabuf fd and shouldn't the dmabuf fd have some
>> >> >> >> > signaling mechanism to the user (eventfd perhaps) to notify
>> >> >> >> > the user to re-
>> >> >evaluate the parameters?
>> >> >> >>
>> >> >> >> dma-bufs don't support that, they are really just a handle to
>> >> >> >> a piece of memory, all metadata (format, size) most be
>> >> >> >> communicated by
>> >> >other means.
>> >> >> >>
>> >> >> >> > Otherwise are you imagining that the user polls the vfio region?
>> >> >> >>
>> >> >> >> Hmm, notification support would probably a good reason to
>> >> >> >> have a separate file handle to manage the dma-bufs (instead
>> >> >> >> of using driver-specific ioctls on the vfio fd), because the
>> >> >> >> driver could also use the management fd for notifications then.
>> >> >> >
>> >> >> >I like this idea of a separate control fd for dmabufs, it
>> >> >> >provides not only a central management point, but also a nice
>> >> >> >abstraction for the vfio device specific interface. We
>> >> >> >potentially only need a single
>> >> >> >VFIO_DEVICE_GET_DMABUF_MGR_FD() ioctl to get a dmabuf
>> >> >> >management fd (perhaps with a type parameter, ex. GFX) where
>> >> >> >maybe we could have vfio-core incorporate this reference into
>> >> >> >the group lifecycle, so the vendor driver only needs to
>> >> >> >fdget/put this manager fd for the various plane dmabuf fds
>> >> >> >spawned in order to get core-
>> >level reference counting.
>> >> >> Following is my understanding of the management fd idea:
>> >> >> 1) QEMU will call VFIO_DEVICE_GET_DMABUF_MGR_FD() ioctl to
>> >> >> create a fd
>> >> >and saved the fd in vfio group while initializing the vfio.
>> >> >
>> >> >Ideally there'd be kernel work here too if we want vfio-core to
>> >> >incorporate lifecycle of this fd into the device/group/container
>> >> >lifecycle. Maybe we even want to generalize it further to
>> >> >something like VFIO_DEVICE_GET_FD which takes a parameter of what
>> >> >type of FD to get, GFX_DMABUF_MGR_FD in this case. vfio- core
>> >> >would probably allocate the fd, tap into the release hook for
>> >> >reference counting and pass it to the vfio_device_ops (mdev vendor
>> >> >driver in this case) to attach
>> >further.
>> >> I tried to implement this today and now it functionally worked.
>> >> I am still a little confuse of how to tap the fd into the release
>> >> hook of
>> >device/group/container.
>> >> I tried to create the fd in vfio core but found it is difficult to
>> >> get the file
>> >operations for the fd, the file operations should be supplied by vendor drivers.
>> >
>> >I'm not fully convinced there's benefit to having vfio-core attempt
>> >to do this, I was just hoping to avoid each vendor driver needing to
>> >implement their own reference counting. I don't think vfio-core
>> >wants to get into tracking each ioctl for each vendor specific fd
>> >type to know which create new references and which don't. Perhaps we'll
>come up with easier ways to do this as we go.
>> Got it.
>>
>> >
>> >> So the fd is created in kvmgt.c for now.
>> >> Below is part of the codes:
>> >>
>> >> diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c
>> >> b/drivers/gpu/drm/i915/gvt/kvmgt.c
>> >> index 389f072..d0649ba 100644
>> >> --- a/drivers/gpu/drm/i915/gvt/kvmgt.c
>> >> +++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
>> >> @@ -41,6 +41,7 @@
>> >> #include <linux/kvm_host.h>
>> >> #include <linux/vfio.h>
>> >> #include <linux/mdev.h>
>> >> +#include <linux/anon_inodes.h>
>> >>
>> >> #include "i915_drv.h"
>> >> #include "gvt.h"
>> >> @@ -524,6 +525,63 @@ static int intel_vgpu_reg_init_opregion(struct
>> >intel_vgpu *vgpu)
>> >> return ret;
>> >> }
>> >>
>> >> +static int intel_vgpu_dmabuf_mgr_fd_mmap(struct file *file, struct
>> >> +vm_area_struct *vma) {
>> >> + WARN_ON(1);
>> >
>> >A user can abuse this, simply return error.
>> OK.
>>
>> >
>> >> +
>> >> + return 0;
>> >> +}
>> >> +
>> >> +static int intel_vgpu_dmabuf_mgr_fd_release(struct inode *inode,
>> >> +struct file *filp) {
>> >> + struct intel_vgpu *vgpu = filp->private_data;
>> >> +
>> >> + if (vgpu->vdev.vfio_device != NULL)
>> >> + vfio_device_put(vgpu->vdev.vfio_device);
>> >
>> >When does the case occur where we don't have a vfio_device? This
>> >looks a bit like a warning flag that reference counting isn't handled properly.
>> This situation happen only when anonymous fd created successfully but error
>occur while trying to get the vfio_device.
>> We should return error while user space trying to create the management fd
>and print an error message while kernel release this fd.
>
>I think the code should be re-ordered so the anon-fd is created last, then we don't
>need to worry about this invalid release callback.
OK. Will try to get the vfio device first and create the fd only when successfully get the vfio device.

>
>> >> +
>> >> + return 0;
>> >> +}
>> >> +
>> >> +static long intel_vgpu_dmabuf_mgr_fd_ioctl(struct file *filp,
>> >> + unsigned int ioctl, unsigned long arg) {
>> >> + struct intel_vgpu *vgpu = filp->private_data;
>> >> + int minsz;
>> >> + struct intel_vgpu_dmabuf dmabuf;
>> >> + int ret;
>> >> + struct fd f;
>> >> + f = fdget(dmabuf.fd);
>> >> + minsz = offsetofend(struct intel_vgpu_dmabuf, tiled);
>> >> + if (copy_from_user(&dmabuf, (void __user *)arg, minsz))
>> >> + return -EFAULT;
>> >> + if (ioctl == INTEL_VGPU_QUERY_DMABUF)
>> >> + ret = intel_gvt_ops->vgpu_query_dmabuf(vgpu, &dmabuf);
>> >> + else if (ioctl == INTEL_VGPU_GENERATE_DMABUF)
>> >> + ret = intel_gvt_ops->vgpu_generate_dmabuf(vgpu,
>> >> +&dmabuf);
>> >
>> >Why do we need vendor specific ioctls here? Aren't querying the
>> >current plane and getting an fd for that plane very generic concepts?
>> >Is the resulting dmabuf Intel specific?
>> No. not Intel specific. Like Gerd said "Typical userspace workflow is
>> to first query the plane, then check if it already has a dma-buf for it, and if not
>create one".
>> We first query the plane info(WITHOUT creating a fd).
>> User space need to check whether there's a dmabuf for the plane(user space
>usually cached two or three dmabuf to handle double buffer or triple buffer
>situation) only there's no dmabuf for the plane we will create a dmabuf for
>it(another ioctl).
>
>If our ioctls are "Query current plane" and "Give me a dmabuf for current plane",
>isn't that racey?
Yes. Indeed there is such potential problem. so when user request "Give me a dmabuf for current plane" a dmabuf for the "current" plane("maybe" not the only we queried before) is created.
In the implementation I did not use the plane info while querying the plane but decode the current plane again to get the current plane information. This maybe not so efficient but can reduce the racey.

>The current plane could have changed between those two calls
>so the user doesn't absolutely know which plane the dmabuf retrieved is for. The
>"Give me a dmabuf" therefore needs to take some sort of plane index so the user
>can request a specific plane. The vendor driver might decide to return -EAGAIN if
>asking for the non-current plane, but we need that determinism built into the ioctl
>interface.
Only adding an index can not fix this problem.
If you check the whole v2 patch you will find we have to decode the current plane to get the plane information.
I tried two methods while coding:

Method1:
OP1: When user ioctl"query current plane" kernel will decode the current plane and return the information to user.
OP2: When user ioctl"Give me a dmabuf for current plane" kernel will decode the current plane again and use the decoded plane information to create a dma-buf.

Method2:
OP1: when user ioctl "query current plane" kernel will decode the current plane and save the result into vgpu and return the plane information to user.
OP2: when user ioctl "Give me a dmabuf" kernel just use the saved plane information to create a dmabuf.

While creating the dmabuf we need the plane's gma(graphics memory address, gma got while decoding the plane) to traverse the GTT(Graphics Translation Table) to get the physical address of the plane.
In theory there is possibility that the plane has changed while we traverse the GTT(there is intervals between decode the plane and traverse the GTT).

There is no difference while testing the above two methods.

>Thanks,
>
>Alex
>
>> >> + else {
>> >> + fdput(f);
>> >> + gvt_vgpu_err("unsupported dmabuf operation\n");
>> >> + return -EINVAL;
>> >> + }
>> >> +
>> >> + if (ret != 0) {
>> >> + fdput(f);
>> >> + gvt_vgpu_err("gvt-g get dmabuf failed:%d\n", ret);
>> >> + return -EINVAL;
>> >> + }
>> >> + fdput(f);
>> >> +
>> >> + return copy_to_user((void __user *)arg, &dmabuf, minsz) ?
>> >> +-EFAULT : 0; }
>> >> +
>> >> +static const struct file_operations intel_vgpu_dmabuf_mgr_fd_ops = {
>> >> + .release = intel_vgpu_dmabuf_mgr_fd_release,
>> >> + .unlocked_ioctl = intel_vgpu_dmabuf_mgr_fd_ioctl,
>> >> + .mmap = intel_vgpu_dmabuf_mgr_fd_mmap,
>> >> + .llseek = noop_llseek,
>> >> +};
>> >> +
>> >> static int intel_vgpu_create(struct kobject *kobj, struct
>> >> mdev_device
>> >> *mdev) {
>> >> struct intel_vgpu *vgpu = NULL; @@ -1259,6 +1317,31 @@
>> >> static long intel_vgpu_ioctl(struct mdev_device *mdev, unsigned int cmd,
>> >> } else if (cmd == VFIO_DEVICE_RESET) {
>> >> intel_gvt_ops->vgpu_reset(vgpu);
>> >> return 0;
>> >> + } else if (cmd == VFIO_DEVICE_GET_FD) {
>> >> + struct vfio_fd vfio_fd;
>> >> + int fd;
>> >> + struct vfio_device *device;
>> >> +
>> >> + minsz = offsetofend(struct vfio_fd, fd);
>> >> + if (copy_from_user(&vfio_fd, (void __user *)arg, minsz))
>> >> + return -EINVAL;
>> >> +
>> >> + if (vfio_fd.argsz < minsz)
>> >> + return -EINVAL;
>> >> +
>> >> + fd = anon_inode_getfd("vfio_dmabuf_mgr_fd",
>> >&intel_vgpu_dmabuf_mgr_fd_ops,
>> >> + vgpu, O_RDWR | O_CLOEXEC);
>> >> + if (fd < 0)
>> >> + return -EINVAL;
>> >> +
>> >> + vfio_fd.fd = fd;
>> >> + device = vfio_device_get_from_dev(mdev_dev(mdev));
>> >> + if (device == NULL)
>> >> + gvt_vgpu_err("kvmgt: vfio device is null\n");
>> >> + else
>> >> + vgpu->vdev.vfio_device = device;
>> >> +
>> >> + return copy_to_user((void __user *)arg, &vfio_fd,
>> >> + minsz) ? -EFAULT : 0;
>> >> }
>> >>
>> >> return 0;
>> >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>> >> index 519eff3..98be2e0 100644
>> >> --- a/include/uapi/linux/vfio.h
>> >> +++ b/include/uapi/linux/vfio.h
>> >> @@ -484,6 +485,20 @@ struct vfio_pci_hot_reset {
>> >>
>> >> #define VFIO_DEVICE_PCI_HOT_RESET _IO(VFIO_TYPE, VFIO_BASE + 13)
>> >>
>> >> +/**
>> >> + * VFIO_DEVICE_GET_FD - _IOW(VFIO_TYPE, VFIO_BASE + 21, struct
>> >> +vfio_fd)
>> >> + *
>> >> + * Create a fd for a vfio device.
>> >> + * This fd can be used for various purpose.
>> >> + */
>> >> +struct vfio_fd {
>> >> + __u32 argsz;
>> >> + __u32 flags;
>> >> + /* out */
>> >> + __u32 fd;
>> >> +};
>> >> +#define VFIO_DEVICE_GET_FD _IO(VFIO_TYPE, VFIO_BASE + 14)
>> >
>> >
>> >The idea was that we pass some sort of type to VFIO_DEVICE_GET_FD,
>> >for instance we might ask for a DEVICE_FD_GRAPHICS_DMABUF and the
>> >vfio bus driver (mdev vendor driver) would test whether it supports
>> >that type of thing and either return an fd or error. We can return
>> >the fd the same way we do for VFIO_DEVICE_GET_FD. For instance the user
>should do something like:
>> >
>> >dmabuf_fd = ioctl(device_fd,
>> > VFIO_DEVICE_GET_FD, DEVICE_FD_GRAPHICS_DMABUF); if
>(dmabuf_fd < 0)
>> > /* not supported... */
>> >else
>> > /* do stuff */
>> OK. Got it.
>>
>>
>> >
>> >Thanks,
>> >Alex