Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data

From: Dan Streetman
Date: Tue Jan 10 2017 - 13:43:17 EST


On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@xxxxxxxx> wrote:
> On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini
> <sstabellini@xxxxxxxxxx> wrote:
>> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote:
>>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote:
>>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky
>>> > <boris.ostrovsky@xxxxxxxxxx> wrote:
>>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote:
>>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote:
>>> > >>> Do not read a pci device's msi message data to see if a pirq was
>>> > >>> previously configured for the device's msi/msix, as the old pirq was
>>> > >>> unmapped and may now be in use by another pci device. The previous
>>> > >>> pirq should never be re-used; instead a new pirq should always be
>>> > >>> allocated from the hypervisor.
>>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs
>>> > >> as we are not reusing them?
>>> > >
>>> > > Don't we free the pirq when we unmap it?
>>> >
>>> > I think this is actually a bit worse than I initially thought. After
>>> > looking a bit closer, and I think there's an asymmetry with pirq
>>> > allocation:
>>>
>>> Lets include Stefano,
>>>
>>> Thank you for digging in this! This has quite the deja-vu
>>> feeling as I believe I hit this at some point in the past and
>>> posted some possible ways of fixing this. But sadly I can't
>>> find the thread.
>>
>> This issue seems to be caused by:
>>
>> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f
>> Author: Stefano Stabellini <stefano.stabellini@xxxxxxxxxxxxx>
>> Date: Wed Dec 1 14:51:44 2010 +0000
>>
>> xen: fix MSI setup and teardown for PV on HVM guests
>>
>> which was a fix to a bug:
>>
>> This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when
>> trying to enable the same MSI for the second time: the old MSI to pirq
>> mapping is still valid at this point but xen_hvm_setup_msi_irqs would
>> try to assign a new pirq anyway.
>> A simple way to reproduce this bug is to assign an MSI capable network
>> card to a PV on HVM guest, if the user brings down the corresponding
>> ethernet interface and up again, Linux would fail to enable MSIs on the
>> device.
>>
>> I don't remember any of the details. From the description of this bug,
>> it seems that Xen changed behavior in the past few years: before it used
>> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the
>> old MSI to pirq mapping is still valid at this point", the pirq couldn't
>> have been completely freed, then reassigned to somebody else the way it
>> is described in this email.
>>
>> I think we should indentify the changeset or Xen version that introduced
>> the new behavior. If it is old enough, we might be able to just revert
>> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the
>> behavior conditional to the Xen version.
>
> Are PT devices the only MSI-capable devices available in a Xen guest?
> That's where I'm seeing this problem, with PT NVMe devices.
>
> I can say that on the Xen guest with NVMe PT devices I'm testing on,
> with the patch from this thread (which essentially reverts your commit
> above) as well as some added debug to see the pirq numbers, cycles of
> 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the
> hypervisor provides essentially the same pirqs for each modprobe,
> since they were freed by the rmmod.
>
>>
>>
>>> > tl;dr:
>>> >
>>> > pci_enable_msix_range() -> each MSIX (or MSI) now has a pirq
>>> > allocated, and reserved in the hypervisor
>>> >
>>> > request_irq() -> an event channel is opened for the specific pirq, and
>>> > maps the pirq with the hypervisor
>>> >
>>> > free_irq() -> the event channel is closed, and the pirq is unmapped,
>>> > but that unmap function also frees the pirq! The hypervisor can/will
>>> > give it away to the next call to get_free_pirq. However, the pci
>>> > msi/msix data area still contains the pirq number, and the next call
>>> > to request_irq() will re-use the pirq.
>>> >
>>> > pci_disable_msix() -> this has no effect on the pirq in the hypervisor
>>> > (it's already been freed), and it also doesn't clear anything from the
>>> > msi/msix data area, so the pirq is still cached there.
>>> >
>>> >
>>> > It seems like the hypervisor needs to be fixed to *not* unmap the pirq
>>> > when the event channel is closed - or at least, not to change it to
>>> > IRQ_UNBOUND state? And, the pci_disable_msix call should eventually
>>> > call into something in the Xen guest kernel that actually does the
>>> > pirq unmapping, and clear it from the msi data area (i.e.
>>> > pci_write_msi_msg)
>>>
>>> The problem is that Xen changes have sailed a long long time ago.
>>> >
>>> > Alternately, if the hypervisor doesn't change, then the call into the
>>> > hypervisor to actually allocate a pirq needs to move from the
>>> > pci_enable_msix_range() call to the request_irq() call? So that when
>>> > the msi/msix range is enabled, it doesn't actually reserve any pirq's
>>> > for any of the vectors; each request_irq/free_irq pair do the pirq
>>> > allocate-and-map/unmap...
>>>
>>>
>>> Or a third one: We keep an pirq->device lookup and inhibit free_irq()
>>> from actually calling evtchn_close() until the pci_disable_msix() is done?
>>
>> I think that's a reasonable alternative: we mask the evtchn, but do not
>> call xen_evtchn_close in shutdown_pirq for PV on HVM guests.
>> Something like (not compiled, untested):
>>
>> diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
>> index 137bd0e..3174923 100644
>> --- a/drivers/xen/events/events_base.c
>> +++ b/drivers/xen/events/events_base.c
>> @@ -577,7 +577,8 @@ static void shutdown_pirq(struct irq_data *data)
>> return;
>>
>> mask_evtchn(evtchn);
>> - xen_evtchn_close(evtchn);
>> + if (!xen_hvm_domain())
>> + xen_evtchn_close(evtchn);
>
> wouldn't we need to also add code to the pci_disable_msix path that
> would actually close the evtchn? Would this leave the evtchn around
> forever?
>
>> xen_irq_info_cleanup(info);
>> }
>>
>>
>> We want to keep the pirq allocated to the device - not closing the
>> evtchn seems like the right thing to do. I suggest we test for the
>> original bug too: enable/disable the network interface of an MSI capable
>> network card.
>
> I don't have a Xen hypervisor setup myself, I'm just using AWS, but
> I'll try to test this on an instance with a SRIOV/PT nic.

I started an instance with SRIOV nics, with the latest upstream kernel
plus this patch and debug to show the pirqs. Taking an interface down
and back up uses the same pirq, because the driver only does
free_irq/request_irq, which just closes/reopens the evtchn using the
same pirq (which is cached in the struct irq_info) - it doesn't
disable/reenable the MSIX vectors. Doing a complete rmmod/modprobe of
the driver does disable and reenable the MSIX vectors, but the
hypervisor provides the same pirqs from the get_free_pirq() call that
were used before (since nothing else is asking for free pirqs).

>
>>
>>
>>> > longer details:
>>> >
>>> > The chain of function calls starts in the initial call to configure
>>> > the msi vectors, which eventually calls __pci_enable_msix_range (or
>>> > msi_) which then eventually reaches xen_hvm_setup_msi_irqs(), which
>>> > either tries to re-use any cached pirq in the MSI data area, or (for
>>> > the first time setup) allocates a new pirq from the hypervisor via
>>> > PHYSDEVOP_get_free_pirq. That pirq is then reserved from the
>>> > hypervisor's perspective, but it's not mapped to anything in the guest
>>> > kernel.
>>> >
>>> > Then, the driver calls request_irq to actually start using the irq,
>>> > which calls __setup_irq to irq_startup to startup_pirq. The
>>> > startup_pirq call actually creates the evtchn and binds it to the
>>> > previously allocated pirq via EVTCHNOP_bind_pirq.
>>> >
>>> > At this point, the pirq is bound to a guest kernel evtchn (and irq)
>>> > and is in use. But then, when the driver doesn't want it anymore, it
>>> > calls free_irq, and that calls irq_shutdown to shutdown_pirq; and that
>>> > function closes the evtchn via EVTCHNOP_close.
>>> >
>>> > Inside the hypervisor, in xen/common/event_channel.c in
>>> > evtchn_close(), if the channel is type ECS_PIRQ (which our pirq
>>> > channel is) then it unmaps the pirq mapping via
>>> > unmap_domain_pirq_emuirq. This unmaps the pirq, but also puts it back
>>> > to state IRQ_UNBOUND, which makes it available for the hypervisor to
>>> > give away to anyone requesting a new pirq!
>>> >
>>> >
>>> >
>>> >
>>> >
>>> > >
>>> > > -boris
>>> > >
>>> > >>> The xen_hvm_setup_msi_irqs() function currently checks the pci device's
>>> > >>> msi descriptor message data for each msi/msix vector it sets up, and if
>>> > >>> it finds the vector was previously configured with a pirq, and that pirq
>>> > >>> is mapped to an irq, it re-uses the pirq instead of requesting a new pirq
>>> > >>> from the hypervisor. However, that pirq was unmapped when the pci device
>>> > >>> disabled its msi/msix, and it cannot be re-used; it may have been given
>>> > >>> to a different pci device.
>>> > >> Hm, but this implies that we do keep track of it.
>>> > >>
>>> > >>
>>> > >> while (true)
>>> > >> do
>>> > >> rmmod nvme
>>> > >> modprobe nvme
>>> > >> done
>>> > >>
>>> > >> Things go boom without this patch. But with this patch does this
>>> > >> still work? As in we don't run out of PIRQs?
>>> > >>
>>> > >> Thanks.
>>> > >>> This exact situation is happening in a Xen guest where multiple NVMe
>>> > >>> controllers (pci devices) are present. The NVMe driver configures each
>>> > >>> pci device's msi/msix twice; first to configure a single vector (to
>>> > >>> talk to the controller for its configuration info), and then it disables
>>> > >>> that msi/msix and re-configures with all the msi/msix it needs. When
>>> > >>> multiple NVMe controllers are present, this happens concurrently on all
>>> > >>> of them, and in the time between controller A calling pci_disable_msix()
>>> > >>> and then calling pci_enable_msix_range(), controller B enables its msix
>>> > >>> and gets controller A's pirq allocated from the hypervisor. Then when
>>> > >>> controller A re-configures its msix, its first vector tries to re-use
>>> > >>> the same pirq that it had before; but that pirq was allocated to
>>> > >>> controller B, and thus the Xen event channel for controller A's re-used
>>> > >>> pirq fails to map its irq to that pirq; the hypervisor already has the
>>> > >>> pirq mapped elsewhere.
>>> > >>>
>>> > >>> Signed-off-by: Dan Streetman <dan.streetman@xxxxxxxxxxxxx>
>>> > >>> ---
>>> > >>> arch/x86/pci/xen.c | 23 +++++++----------------
>>> > >>> 1 file changed, 7 insertions(+), 16 deletions(-)
>>> > >>>
>>> > >>> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
>>> > >>> index bedfab9..a00a6c0 100644
>>> > >>> --- a/arch/x86/pci/xen.c
>>> > >>> +++ b/arch/x86/pci/xen.c
>>> > >>> @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
>>> > >>> return 1;
>>> > >>>
>>> > >>> for_each_pci_msi_entry(msidesc, dev) {
>>> > >>> - __pci_read_msi_msg(msidesc, &msg);
>>> > >>> - pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) |
>>> > >>> - ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff);
>>> > >>> - if (msg.data != XEN_PIRQ_MSI_DATA ||
>>> > >>> - xen_irq_from_pirq(pirq) < 0) {
>>> > >>> - pirq = xen_allocate_pirq_msi(dev, msidesc);
>>> > >>> - if (pirq < 0) {
>>> > >>> - irq = -ENODEV;
>>> > >>> - goto error;
>>> > >>> - }
>>> > >>> - xen_msi_compose_msg(dev, pirq, &msg);
>>> > >>> - __pci_write_msi_msg(msidesc, &msg);
>>> > >>> - dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
>>> > >>> - } else {
>>> > >>> - dev_dbg(&dev->dev,
>>> > >>> - "xen: msi already bound to pirq=%d\n", pirq);
>>> > >>> + pirq = xen_allocate_pirq_msi(dev, msidesc);
>>> > >>> + if (pirq < 0) {
>>> > >>> + irq = -ENODEV;
>>> > >>> + goto error;
>>> > >>> }
>>> > >>> + xen_msi_compose_msg(dev, pirq, &msg);
>>> > >>> + __pci_write_msi_msg(msidesc, &msg);
>>> > >>> + dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq);
>>> > >>> irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq,
>>> > >>> (type == PCI_CAP_ID_MSI) ? nvec : 1,
>>> > >>> (type == PCI_CAP_ID_MSIX) ?
>>> > >>> --
>>> > >>> 2.9.3
>>> > >>>
>>> > >
>>> > >
>>> > > _______________________________________________
>>> > > Xen-devel mailing list
>>> > > Xen-devel@xxxxxxxxxxxxx
>>> > > https://lists.xen.org/xen-devel
>>>