Re: [PATCH V3 09/10] vfio/pci: Support dynamic MSI-X

From: Alex Williamson
Date: Wed Apr 19 2023 - 17:39:04 EST


On Wed, 19 Apr 2023 11:13:29 -0700
Reinette Chatre <reinette.chatre@xxxxxxxxx> wrote:

> Hi Alex,
>
> On 4/18/2023 3:38 PM, Alex Williamson wrote:
> > On Tue, 18 Apr 2023 10:29:20 -0700
> > Reinette Chatre <reinette.chatre@xxxxxxxxx> wrote:
> >
> >> Recently introduced pci_msix_alloc_irq_at() and pci_msix_free_irq()
> >> enables an individual MSI-X interrupt to be allocated and freed after
> >> MSI-X enabling.
> >>
> >> Use dynamic MSI-X (if supported by the device) to allocate an interrupt
> >> after MSI-X is enabled. An MSI-X interrupt is dynamically allocated at
> >> the time a valid eventfd is assigned. This is different behavior from
> >> a range provided during MSI-X enabling where interrupts are allocated
> >> for the entire range whether a valid eventfd is provided for each
> >> interrupt or not.
> >>
> >> Do not dynamically free interrupts, leave that to when MSI-X is
> >> disabled.
> >
> > But we do, sometimes, even if essentially only on the error path. Is
> > that worthwhile? It seems like we could entirely remove
> > vfio_msi_free_irq() and rely only on pci_free_irq_vectors() on MSI/X
> > teardown.
>
> Yes, it is only on the error path where dynamic MSI-X interrupts are
> removed. I do not know how to determine if it is worthwhile. On the
> kernel side failure seems unlikely since it would mean memory cannot
> be allocated or request_irq() failed. In these cases it may not be
> worthwhile since user space may try again and having the interrupt
> already allocated would be helpful. The remaining error seems to be
> if user space provided an invalid eventfd. An allocation in response
> to wrong user input is a concern to me. Should we consider
> buggy/malicious users? I am uncertain here so would defer to your
> guidance.

I don't really see that a malicious user can exploit anything here,
their irq allocation is bound by the device support and they're
entitled to make use of the full vector set of the device by virtue of
having ownership of the device. All the MSI-X allocated irqs are freed
when the interrupt mode is changed or the device is released regardless.

The end result is also no different than if the user had not configured
the vector when enabling MSI-X or configured it and later de-configured
with a -1 eventfd. The irq is allocated but not attached to a ctx.
We're intentionally using this as a cache.

Also, as implemented here in v3, we might be freeing from the original
allocation rather than a new, dynamically allocated irq.

My thinking is that if we think there's a benefit to caching any
allocated irqs, we should do so consistently. We don't currently know
if the irq was allocated now or previously. Tracking that would add
complexity for little benefit. The user can get to the same end result
of an allocated, unused irq in numerous way, the state itself is not
erroneous, and is actually in support of caching irq allocations.
Removing the de-allocation of a single vector further simplifies the
code as there exists only one path where irqs are freed, ie.
pci_free_irq_vectors().

So I'd lean towards removing vfio_msi_free_irq().

> > I'd probably also add a comment in the commit log about the theory
> > behind not dynamically freeing irqs, ie. latency, reliability, and
> > whatever else we used to justify it. Thanks,
>
> Sure. How about something like below to replace the final sentence of
> the changelog:
>
> "When a guest disables an interrupt, user space (Qemu) does not
> disable the interrupt but instead assigns it a different trigger. A
> common flow is thus for the VFIO_DEVICE_SET_IRQS ioctl() to be
> used to assign a new eventfd to an already enabled interrupt. Freeing
> and re-allocating an interrupt in this scenario would add unnecessary
> latency as well as uncertainty since the re-allocation may fail. Do
> not dynamically free interrupts when an interrupt is disabled, instead
> support a subsequent re-enable to draw from the initial allocation when
> possible. Leave freeing of interrupts to when MSI-X is disabled."

There are other means besides caching irqs that could achieve the above,
for instance if a trigger is simply swapped from one eventfd to another,
that all happens within vfio_msi_set_vector_signal() where we could
hold the irq for the transition.

I think I might justify it as:

The PCI-MSIX API requires that some number of irqs are
allocated for an initial set of vectors when enabling MSI-X on
the device. When dynamic MSIX allocation is not supported, the
vector table, and thus the allocated irq set can only be resized
by disabling and re-enabling MSIX with a different range. In
that case the irq allocation is essentially a cache for
configuring vectors within the previously allocated vector
range. When dynamic MSIX allocation is supported, the API
still requires some initial set of irqs to be allocated, but
also supports allocating and freeing specific irq vectors both
within and beyond the initially allocated range.

For consistency between modes, as well as to reduce latency and
improve reliability of allocations, and also simplicity, this
implementation only releases irqs via pci_free_irq_vectors()
when either the interrupt mode changes or the device is
released.

Does that cover the key points for someone that might want to revisit
this decision later? Thanks,

Alex