Re: [Xen-devel] [PATCH] xen: do not re-use pirq number cached in pci device msi msg data

From: Stefano Stabellini
Date: Wed Jan 11 2017 - 13:46:44 EST


On Wed, 11 Jan 2017, Dan Streetman wrote:
> On Tue, Jan 10, 2017 at 8:25 PM, Stefano Stabellini
> <sstabellini@xxxxxxxxxx> wrote:
> > On Tue, 10 Jan 2017, Dan Streetman wrote:
> >> On Tue, Jan 10, 2017 at 2:03 PM, Stefano Stabellini
> >> <sstabellini@xxxxxxxxxx> wrote:
> >> > On Tue, 10 Jan 2017, Dan Streetman wrote:
> >> >> On Tue, Jan 10, 2017 at 10:57 AM, Dan Streetman <ddstreet@xxxxxxxx> wrote:
> >> >> > On Mon, Jan 9, 2017 at 2:30 PM, Stefano Stabellini
> >> >> > <sstabellini@xxxxxxxxxx> wrote:
> >> >> >> On Mon, 9 Jan 2017, Konrad Rzeszutek Wilk wrote:
> >> >> >>> On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote:
> >> >> >>> > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky
> >> >> >>> > <boris.ostrovsky@xxxxxxxxxx> wrote:
> >> >> >>> > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote:
> >> >> >>> > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote:
> >> >> >>> > >>> Do not read a pci device's msi message data to see if a pirq was
> >> >> >>> > >>> previously configured for the device's msi/msix, as the old pirq was
> >> >> >>> > >>> unmapped and may now be in use by another pci device. The previous
> >> >> >>> > >>> pirq should never be re-used; instead a new pirq should always be
> >> >> >>> > >>> allocated from the hypervisor.
> >> >> >>> > >> Won't this cause a starvation problem? That is we will run out of PIRQs
> >> >> >>> > >> as we are not reusing them?
> >> >> >>> > >
> >> >> >>> > > Don't we free the pirq when we unmap it?
> >> >> >>> >
> >> >> >>> > I think this is actually a bit worse than I initially thought. After
> >> >> >>> > looking a bit closer, and I think there's an asymmetry with pirq
> >> >> >>> > allocation:
> >> >> >>>
> >> >> >>> Lets include Stefano,
> >> >> >>>
> >> >> >>> Thank you for digging in this! This has quite the deja-vu
> >> >> >>> feeling as I believe I hit this at some point in the past and
> >> >> >>> posted some possible ways of fixing this. But sadly I can't
> >> >> >>> find the thread.
> >> >> >>
> >> >> >> This issue seems to be caused by:
> >> >> >>
> >> >> >> commit af42b8d12f8adec6711cb824549a0edac6a4ae8f
> >> >> >> Author: Stefano Stabellini <stefano.stabellini@xxxxxxxxxxxxx>
> >> >> >> Date: Wed Dec 1 14:51:44 2010 +0000
> >> >> >>
> >> >> >> xen: fix MSI setup and teardown for PV on HVM guests
> >> >> >>
> >> >> >> which was a fix to a bug:
> >> >> >>
> >> >> >> This fixes a bug in xen_hvm_setup_msi_irqs that manifests itself when
> >> >> >> trying to enable the same MSI for the second time: the old MSI to pirq
> >> >> >> mapping is still valid at this point but xen_hvm_setup_msi_irqs would
> >> >> >> try to assign a new pirq anyway.
> >> >> >> A simple way to reproduce this bug is to assign an MSI capable network
> >> >> >> card to a PV on HVM guest, if the user brings down the corresponding
> >> >> >> ethernet interface and up again, Linux would fail to enable MSIs on the
> >> >> >> device.
> >> >> >>
> >> >> >> I don't remember any of the details. From the description of this bug,
> >> >> >> it seems that Xen changed behavior in the past few years: before it used
> >> >> >> to keep the pirq-MSI mapping, while today it doesn't. If I wrote "the
> >> >> >> old MSI to pirq mapping is still valid at this point", the pirq couldn't
> >> >> >> have been completely freed, then reassigned to somebody else the way it
> >> >> >> is described in this email.
> >> >> >>
> >> >> >> I think we should indentify the changeset or Xen version that introduced
> >> >> >> the new behavior. If it is old enough, we might be able to just revert
> >> >> >> af42b8d12f8adec6711cb824549a0edac6a4ae8f. Otherwise we could make the
> >> >> >> behavior conditional to the Xen version.
> >> >> >
> >> >> > Are PT devices the only MSI-capable devices available in a Xen guest?
> >> >> > That's where I'm seeing this problem, with PT NVMe devices.
> >> >
> >> > They are the main ones. It is possible to have emulated virtio devices
> >> > with emulated MSIs, for example virtio-net. Althought they are not in
> >> > the Xen Project CI-loop, so I wouldn't be surprised if they are broken
> >> > too.
> >> >
> >> >
> >> >> > I can say that on the Xen guest with NVMe PT devices I'm testing on,
> >> >> > with the patch from this thread (which essentially reverts your commit
> >> >> > above) as well as some added debug to see the pirq numbers, cycles of
> >> >> > 'modprobe nvme ; rmmod nvme' don't cause pirq starvation, as the
> >> >> > hypervisor provides essentially the same pirqs for each modprobe,
> >> >> > since they were freed by the rmmod.
> >> >
> >> > I am fine with reverting the old patch, but we need to understand what
> >> > caused the change in behavior first. Maybe somebody else with a Xen PCI
> >> > passthrough setup at hand can help with that.
> >> >
> >> > In the Xen code I can still see:
> >> >
> >> > case ECS_PIRQ: {
> >> > struct pirq *pirq = pirq_info(d1, chn1->u.pirq.irq);
> >> >
> >> > if ( !pirq )
> >> > break;
> >> > if ( !is_hvm_domain(d1) )
> >> > pirq_guest_unbind(d1, pirq);
> >> >
> >> > which means that pirq_guest_unbind should only be called on evtchn_close
> >> > if the guest is not an HVM guest.
> >>
> >> I tried an experiment to call get_free_pirq on both sides of a
> >> evtchn_close hcall, using two SRIOV nics. When I rmmod/modprobe, I
> >> see something interesting; each nic uses 3 MSIs, and it looks like
> >> when they shut down, each nic's 3 pirqs are not available until after
> >> the nic is done shutting down, so it seems like closing the evtchn
> >> isn't what makes the pirq free.
> >>
> >> [3697700.390188] xen:events: creating evtchn using pirq 101 irq 290
> >> [3697700.390214] xen:events: creating evtchn using pirq 100 irq 291
> >> [3697700.390228] xen:events: creating evtchn using pirq 99 irq 292
> >> [3697700.392789] ixgbevf 0000:00:03.0: NIC Link is Up 10 Gbps
> >> [3697700.406167] xen:events: creating evtchn using pirq 98 irq 293
> >> [3697700.406222] xen:events: creating evtchn using pirq 97 irq 294
> >> [3697700.406259] xen:events: creating evtchn using pirq 96 irq 295
> >> [3697700.408345] ixgbevf 0000:00:04.0: NIC Link is Up 10 Gbps
> >>
> >> nic 3 uses pirq 99-101, while nic 4 uses pirq 96-98.
> >>
> >> [3697705.470151] xen:events: shutdown_pirq: xen_domain() == 1,
> >> xen_pv_domain() == 0, xen_hvm_domain() == 1, xen_initial_domain() ==
> >> 0, xen_pvh_domain() == 0
> >>
> >> just to be sure, a bit of dbg so I know what domain this is :-)
> >>
> >> [3697778.781463] xen:events: shutdown_pirq: get_free_pirq returned pirq 93
> >> [3697778.781465] xen:events: shutdown_pirq: closing evtchn for pirq 96 irq 295
> >> [3697778.781475] xen:events: shutdown_pirq: get_free_pirq returned pirq 92
> >> [3697778.781489] xen:events: shutdown_pirq: get_free_pirq returned pirq 91
> >> [3697778.781490] xen:events: shutdown_pirq: closing evtchn for pirq 97 irq 294
> >> [3697778.781498] xen:events: shutdown_pirq: get_free_pirq returned pirq 90
> >> [3697778.781508] xen:events: shutdown_pirq: get_free_pirq returned pirq 89
> >> [3697778.781509] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 293
> >> [3697778.781517] xen:events: shutdown_pirq: get_free_pirq returned pirq 88
> >>
> >> nic 4 is shutdown first, and closes its evtchns for pirqs 96-98; but
> >> none of those become available for get_free_pirq.
> >>
> >> [3697779.005501] xen:events: shutdown_pirq: get_free_pirq returned pirq 98
> >>
> >> aha, now nic 4 has fully finished shutting down, and nic 3 has started
> >> shutdown; we see those pirqs from nic 4 are now available. So it must
> >> not be evtchn closing that frees the pirqs.
> >>
> >> [3697779.005503] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 292
> >> [3697779.005512] xen:events: shutdown_pirq: get_free_pirq returned pirq 97
> >> [3697779.005524] xen:events: shutdown_pirq: get_free_pirq returned pirq 96
> >> [3697779.005526] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 291
> >> [3697779.005540] xen:events: shutdown_pirq: get_free_pirq returned pirq 87
> >> [3697779.005611] xen:events: shutdown_pirq: get_free_pirq returned pirq 86
> >> [3697779.005624] xen:events: shutdown_pirq: closing evtchn for pirq 101 irq 290
> >> [3697779.005659] xen:events: shutdown_pirq: get_free_pirq returned pirq 85
> >>
> >>
> >> so, since pci_disable_msix eventually calls xen_teardown_msi_irq()
> >> which calls xen_destroy_irq(), i moved the dbg to xen_destroy_irq()
> >> (and recompiled/rebooted) and found the pirqs have already been freed
> >> before that is called:
> >>
> >> [3700084.714686] xen:events: shutdown_pirq: closing evtchn for pirq 98 irq 295
> >> [3700084.714702] xen:events: shutdown_pirq: closing evtchn for pirq 99 irq 294
> >> [3700084.714708] xen:events: shutdown_pirq: closing evtchn for pirq 100 irq 293
> >> [3700084.775598] xen:events: xen_destroy_irq: get_free_pirq returned pirq 100
> >> [3700084.775599] xen:events: xen_destroy_irq: pirq 100 irq 293
> >> [3700084.775624] xen:events: xen_destroy_irq: get_free_pirq returned pirq 99
> >> [3700084.775631] xen:events: xen_destroy_irq: get_free_pirq returned pirq 98
> >> [3700084.775632] xen:events: xen_destroy_irq: pirq 99 irq 294
> >> [3700084.775646] xen:events: xen_destroy_irq: get_free_pirq returned pirq 97
> >> [3700084.775653] xen:events: xen_destroy_irq: get_free_pirq returned pirq 96
> >> [3700084.775654] xen:events: xen_destroy_irq: pirq 98 irq 295
> >> [3700084.775666] xen:events: xen_destroy_irq: get_free_pirq returned pirq 95
> >>
> >>
> >> I'm still following thru the kernel code, but it's not immediately
> >> obvious what exactly is telling the hypervisor to free the pirqs; any
> >> idea?
> >>
> >> >From the hypervisor code, it seems that the pirq is "available" via
> >> is_free_pirq():
> >> return !pirq || (!pirq->arch.irq && (!is_hvm_domain(d) ||
> >> pirq->arch.hvm.emuirq == IRQ_UNBOUND));
> >>
> >> when the evtchn is closed, it does:
> >> if ( is_hvm_domain(d1) && domain_pirq_to_irq(d1, pirq->pirq) > 0 )
> >> unmap_domain_pirq_emuirq(d1, pirq->pirq);
> >>
> >> and that call to unmap_domain_pirq_emuirq does:
> >> info->arch.hvm.emuirq = IRQ_UNBOUND;
> >>
> >> so, the only thing left is to clear pirq->arch.irq,but the only place
> >> I can find that does that is clear_domain_irq_pirq(), which is only
> >> called from pirq_guest_unbind() and unmap_domain_pirq(), but I'm not
> >> seeing where either of those would be called when all the kernel is
> >> doing is disabling a pci device.
> >
> > Thanks for the info. I think I know what causes the pirq to be unmapped:
> > when Linux disables msi or msix on the device, using the regular pci
> > config space based method, QEMU (which emulates the PCI config space)
> > tells Xen to unmap the pirq.
>
> aha, via a XEN_DOMCTL_unbind_pt_irq, maybe? Well that makes more sense then.
>
> >
> > If that's the case, and if it isn't possible for xen_hvm_setup_msi_irqs
> > to be called a second time without msis being disabled first, then I
> > think we can revert the patch.
>
> It doesn't seem possible to call it twice from a correctly-behaved
> driver, but in case of a driver bug that does try to enable msi/msix
> multiple times without disabling, __pci_enable_msix() only does
> WARN_ON(!!dev->msix_enabled), and __pci_enable_msi_range() only does
> WARN_ON(!!dev->msi_enabled); they both will continue. Maybe that
> should be changed to warn and also return error, to prevent
> re-configuring msi/msix if already configured? Or, maybe the warning
> is enough - the worst thing that reverting the patch does is use extra
> pirqs, right?

I think the warning is enough. Can you confirm that with
af42b8d12f8adec6711cb824549a0edac6a4ae8f reverted, also

ifconfig eth0 down; ifconfig eth0 up

still work as expected, no warnings?


It looks like the patch that changed hypervisor (QEMU actually) behavior
is:

commit c976437c7dba9c7444fb41df45468968aaa326ad
Author: Zhenzhong Duan <zhenzhong.duan@xxxxxxxxxx>
Date: Wed May 7 13:41:48 2014 +0000

qemu-xen: free all the pirqs for msi/msix when driver unload

>From this commit onward, QEMU started unmapping pirqs when MSIs are
disabled. c976437c7dba9c7444fb41df45468968aaa326ad is present in 4.8,
4.7, 4.6, 4.5. The newest release without the commit is Xen 4.4.

If we revert af42b8d12f8adec6711cb824549a0edac6a4ae8f, we fix the bug on
all Xen versions from 4.5 onward, but we break the behavior on Xen 4.4
and older. Given that Xen 4.4 is out of support, I think we should go
ahead with it. Opinions?