On Thu, 22 Aug 2019 23:34:43 +0800
Ben Luo <luoben@xxxxxxxxxxxxxxxxx> wrote:
When userspace (e.g. qemu) triggers a switch between KVMI think this is a user visible change. Previously the vector is
irqfd and userspace eventfd, only dev_id of irq action
(i.e. the "trigger" in this patch's context) will be
changed, but a free-then-request-irq action is taken in
current code. And, irq affinity setting in VM will also
trigger a free-then-request-irq action, which actually
changes nothing, but only fires a producer re-registration
to update irte in case that posted-interrupt is in use.
This patch makes use of irq_update_devid() and optimize
both cases above, which reduces the risk of losing interrupt
and also cuts some overhead.
Signed-off-by: Ben Luo <luoben@xxxxxxxxxxxxxxxxx>
---
drivers/vfio/pci/vfio_pci_intrs.c | 112 +++++++++++++++++++++++++-------------
1 file changed, 74 insertions(+), 38 deletions(-)
diff --git a/drivers/vfio/pci/vfio_pci_intrs.c b/drivers/vfio/pci/vfio_pci_intrs.c
index 3fa3f72..60d3023 100644
--- a/drivers/vfio/pci/vfio_pci_intrs.c
+++ b/drivers/vfio/pci/vfio_pci_intrs.c
@@ -284,70 +284,106 @@ static int vfio_msi_enable(struct vfio_pci_device *vdev, int nvec, bool msix)
static int vfio_msi_set_vector_signal(struct vfio_pci_device *vdev,
int vector, int fd, bool msix)
{
+ struct eventfd_ctx *trigger = NULL;
struct pci_dev *pdev = vdev->pdev;
- struct eventfd_ctx *trigger;
int irq, ret;
if (vector < 0 || vector >= vdev->num_ctx)
return -EINVAL;
+ if (fd >= 0) {
+ trigger = eventfd_ctx_fdget(fd);
+ if (IS_ERR(trigger))
+ return PTR_ERR(trigger);
+ }
disabled first, then if an error occurs re-enabling, we return an errno
with the vector disabled. Here we instead fail the ioctl and leave the
state as if it had never happened. For instance with QEMU, if they
were trying to change from KVM to userspace signaling and entered this
condition, previously the interrupt would signal to neither eventfd, now
it would continue signaling to KVM. If QEMU's intent was to emulate
vector masking, this could induce unhandled interrupts in the guest.
Maybe we need a tear-down on fault here to maintain that behavior, or
do you see some justification for the change?
Thanks for pointing it out, I will fix this in next version.
+I think we leak the fd context we acquired above in this case.
irq = pci_irq_vector(pdev, vector);
+ /*
+ * For KVM-VFIO case, interrupt from passthrough device will be directly
+ * delivered to VM after producer and consumer connected successfully.
+ * If producer and consumer are disconnected, this interrupt process
+ * will fall back to remap mode, where interrupt handler uses 'trigger'
+ * to find the right way to deliver the interrupt to VM. So, it is safe
+ * to do irq_update_devid() before irq_bypass_unregister_producer() which
+ * switches interrupt process to remap mode. To producer and consumer,
+ * 'trigger' is only a token used for pairing them togather.
+ */
if (vdev->ctx[vector].trigger) {
- free_irq(irq, vdev->ctx[vector].trigger);
- irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
- kfree(vdev->ctx[vector].name);
- eventfd_ctx_put(vdev->ctx[vector].trigger);
- vdev->ctx[vector].trigger = NULL;
+ if (vdev->ctx[vector].trigger == trigger) {
+ /* switch back to remap mode */
+ irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
Why do we do anything in this case, couldn't we just 'put' the extra ctx
and return 0 here?
Actually, I have explained this in comments above this whole control block. I think it is safe and better to
+ } else if (trigger) {Can you explain this ordering, I would have expected that we'd
+ ret = irq_update_devid(irq,
+ vdev->ctx[vector].trigger, trigger);
+ if (unlikely(ret)) {
+ dev_info(&pdev->dev,
+ "update devid of %d (token %p) failed: %d\n",
+ irq, vdev->ctx[vector].trigger, ret);
+ eventfd_ctx_put(trigger);
+ return ret;
+ }
+ irq_bypass_unregister_producer(&vdev->ctx[vector].producer);
unregister the bypass before we updated the devid. Thanks,
Alex