Re: [Qemu-devel] [RFC PATCH] vfio/pci: Add ioeventfd support
From: Auger Eric
Date: Thu Feb 08 2018 - 08:48:28 EST
Hi Alex,
On 07/02/18 17:57, Alex Williamson wrote:
> On Wed, 7 Feb 2018 16:46:19 +0100
> Auger Eric <eric.auger@xxxxxxxxxx> wrote:
>
>> Hi Alex,
>>
>> On 07/02/18 01:08, Alex Williamson wrote:
>>> The ioeventfd here is actually irqfd handling of an ioeventfd such as
>>> supported in KVM. A user is able to pre-program a device write to
>>> occur when the eventfd triggers. This is yet another instance of
>>> eventfd-irqfd triggering between KVM and vfio. The impetus for this
>>> is high frequency writes to pages which are virtualized in QEMU.
>>> Enabling this near-direct write path for selected registers within
>>> the virtualized page can improve performance and reduce overhead.
>>> Specifically this is initially targeted at NVIDIA graphics cards where
>>> the driver issues a write to an MMIO register within a virtualized
>>> region in order to allow the MSI interrupt to re-trigger.
>>>
>>> Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx>
>>
>> fyi it does not apply anymore on master (conflict in
>> include/uapi/linux/vfio.h on GFX stuff)
>
> Sorry, I should have noted that this was against v4.15, I didn't want
> the churn of the merge window since I was benchmarking against this.
> Will update for non-RFC.
>
> ...
>>> +long vfio_pci_ioeventfd(struct vfio_pci_device *vdev, loff_t offset,
>>> + uint64_t data, int count, int fd)
>>> +{
>>> + struct pci_dev *pdev = vdev->pdev;
>>> + loff_t pos = offset & VFIO_PCI_OFFSET_MASK;
>>> + int ret, bar = VFIO_PCI_OFFSET_TO_INDEX(offset);
>>> + struct vfio_pci_ioeventfd *ioeventfd;
>>> + int (*handler)(void *, void *);
>>> + unsigned long val;
>>> +
>>> + /* Only support ioeventfds into BARs */
>>> + if (bar > VFIO_PCI_BAR5_REGION_INDEX)
>>> + return -EINVAL;
>>> +
>>> + if (pos + count > pci_resource_len(pdev, bar))
>>> + return -EINVAL;
>>> +
>>> + /* Disallow ioeventfds working around MSI-X table writes */
>>> + if (bar == vdev->msix_bar &&
>>> + !(pos + count <= vdev->msix_offset ||
>>> + pos >= vdev->msix_offset + vdev->msix_size))
>>> + return -EINVAL;
>>> +
>>> + switch (count) {
>>> + case 1:
>>> + handler = &vfio_pci_ioeventfd_handler8;
>>> + val = data;
>>> + break;
>>> + case 2:
>>> + handler = &vfio_pci_ioeventfd_handler16;
>>> + val = le16_to_cpu(data);
>>> + break;
>>> + case 4:
>>> + handler = &vfio_pci_ioeventfd_handler32;
>>> + val = le32_to_cpu(data);
>>> + break;
>>> +#ifdef iowrite64
>>> + case 8:
>>> + handler = &vfio_pci_ioeventfd_handler64;
>>> + val = le64_to_cpu(data);
>>> + break;
>>> +#endif
>>> + default:
>>> + return -EINVAL;
>>> + }
>>> +
>>> + ret = vfio_pci_setup_barmap(vdev, bar);
>>> + if (ret)
>>> + return ret;
>>> +
>>> + mutex_lock(&vdev->ioeventfds_lock);
>>> +
>>> + list_for_each_entry(ioeventfd, &vdev->ioeventfds_list, next) {
>>> + if (ioeventfd->pos == pos && ioeventfd->bar == bar &&
>>> + ioeventfd->data == data && ioeventfd->count == count) {
>>> + if (fd == -1) {
>>> + vfio_virqfd_disable(&ioeventfd->virqfd);
>>> + list_del(&ioeventfd->next);
>>> + kfree(ioeventfd);
>>> + ret = 0;
>>> + } else
>>> + ret = -EEXIST;
>>> +
>>> + goto out_unlock;
>>> + }
>>> + }
>>> +
>>> + if (fd < 0) {
>>> + ret = -ENODEV;
>>> + goto out_unlock;
>>> + }
>>> +
>>> + ioeventfd = kzalloc(sizeof(*ioeventfd), GFP_KERNEL);
>>> + if (!ioeventfd) {
>>> + ret = -ENOMEM;
>>> + goto out_unlock;
>>> + }
>>> +
>>> + ioeventfd->pos = pos;
>>> + ioeventfd->bar = bar;
>>> + ioeventfd->data = data;
>>> + ioeventfd->count = count;
>>> +
>>> + ret = vfio_virqfd_enable(vdev->barmap[ioeventfd->bar] + ioeventfd->pos,
>> nit: bar and pos could be used directly
>
> Indeed, probably leftover from development. Fixed and re-wrapped the
> following lines.
>
>>> + handler, NULL, (void *)val,
>>> + &ioeventfd->virqfd, fd);
>>> + if (ret) {
>>> + kfree(ioeventfd);
>>> + goto out_unlock;
>>> + }
>>> +
>>> + list_add(&ioeventfd->next, &vdev->ioeventfds_list);
>>> +
>>> +out_unlock:
>>> + mutex_unlock(&vdev->ioeventfds_lock);
>>> +
>>> + return ret;
>>> +}
>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>> index e3301dbd27d4..07966a5f0832 100644
>>> --- a/include/uapi/linux/vfio.h
>>> +++ b/include/uapi/linux/vfio.h
>>> @@ -503,6 +503,30 @@ struct vfio_pci_hot_reset {
>>>
>>> #define VFIO_DEVICE_PCI_HOT_RESET _IO(VFIO_TYPE, VFIO_BASE + 13)
>>>
>>> +/**
>>> + * VFIO_DEVICE_IOEVENTFD - _IOW(VFIO_TYPE, VFIO_BASE + 14,
>>> + * struct vfio_device_ioeventfd)
>>> + *
>>> + * Perform a write to the device at the specified device fd offset, with
>>> + * the specified data and width when the provided eventfd is triggered.
>> don't you want to document the limitation on BAR only and excluding the
>> MSIx table.
>
> Sure, we could. This is also just an acceleration interface, it's not
> required for functionality, so a user can probe the capabilities by
> trying to enable an ioeventfd to test for support. I don't really want
> to add a flag to each region to identify support or create yet another
> sparse map identifying which sections of regions are supported. We
> could enable this for PCI config space, but I didn't think it really
> made sense (and I was too lazy). Real PCI config space (not MMIO
> mirrors of config space on GPUs) should never be a performance path,
> therefore I question if there's any ROI for the code to support it.
> Device specific regions would need to be handled on a case by case
> basis, and I don't think we have any cases yet were it makes sense
> (IIRC the IGD regions are read-only). Of course ROMs are read-only, so
> it doesn't make sense to support them.
>
> I also note that this patch of course only supports directly assigned
> vfio-pci devices, not vfio-platform and not mdev devices. Since the
> ioctl is intended to be device agnostic, maybe we could have a default
> handler that simply uses the device file write interface. At least one
> issue there is that those expect a user buffer. I'll look into how I
> might add the support more generically, if perhaps less optimized.
OK
>
> Does it seem like a sufficient user API to try and ioeventfd and be
> prepared for it to fail? Given that this is a new ioctl, users need to
> do that for backwards compatibility anyway.
looks OK to me.
>
> Something I forgot to mention in my rush to send this out is that I
> debated whether to create a new ioctl or re-use the SET_IRQS ioctl. In
> the end, I couldn't resolve how to handle the index, start, and count
> fields, so I created this new ioctl. Would we create an ioeventfd
> index? We couldn't simply pass an array of ioeventfds since each needs
> to be associated with an offset, data, and size, so we'd need a new
> data type with a structure encompassing those fields. In the end it
> obviously didn't make sense to me to re-use SET_IRQS, but if others can
> come up with something that makes sense, I'm open to it. Thanks,
as far as I am concerned, I prefer this API rather than the SET_IRQS one ;-)
Thanks
Eric
>
> Alex
>