Re: [PATCH] PCI/MSI: Avoid torn updates to MSI pairs

From: Evan Green
Date: Wed Jan 22 2020 - 19:08:24 EST


On Wed, Jan 22, 2020 at 3:37 PM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>
> Evan Green <evgreen@xxxxxxxxxxxx> writes:
> > On Wed, Jan 22, 2020 at 9:28 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> >> I suspect this *is* a problem because I think disabling MSI doesn't
> >> disable interrupts; it just means the device will interrupt using INTx
> >> instead of MSI. And the driver is probably not prepared to handle
> >> INTx.
> >>
> >> PCIe r5.0, sec 7.7.1.2, seems relevant: "If MSI and MSI-X are both
> >> disabled, the Function requests servicing using INTx interrupts (if
> >> supported)."
>
> Disabling MSI is not an option. Masking yes, but MSI does not have
> mandatory masking. We already attempt masking on migration, which covers
> only MSI-X reliably, but not all MSI incarnations.
>
> So I assume that problem happens on a MSI interrupt, right?
>
> >> Maybe the IRQ guys have ideas about how to solve this?
>
> Maybe :)
>
> > But don't we already do this in __pci_restore_msi_state():
> > pci_intx_for_msi(dev, 0);
> > pci_msi_set_enable(dev, 0);
> > arch_restore_msi_irqs(dev);
> >
> > I'd think if there were a chance for a line-based interrupt to get in
> > and wedge itself, it would already be happening there.
>
> That's a completely different beast. It's used when resetting a device
> and for other stuff like virt state migration. That's not a model for
> affinity changes of a live device.

Hm. Ok.

>
> > One other way you could avoid torn MSI writes would be to ensure that
> > if you migrate IRQs across cores, you keep the same x86 vector number.
> > That way the address portion would be updated, and data doesn't
> > change, so there's no window. But that may not actually be feasible.
>
> That's not possible simply because the x86 vector space is limited. If
> we would have to guarantee that then we'd end up with a max of ~220
> interrupts per system. Sufficient for your notebook, but the big iron
> people would be not amused.

Right, that occurred to me as well. The actual requirement isn't quite
as restrictive. What you really need is the old vector to be
registered on both the old CPU and the new CPU. Then once the
interrupt is confirmed to have moved we could release both the old
vector both CPUs, leaving only the new vector on the new CPU.

In that world some SMP affinity transitions might fail, which is a
bummer. To avoid that, you could first migrate to a vector that's
available on both the source and destination CPUs, keeping affinity
the same. Then change affinity in a separate step.

Or alternatively, you could permanently designate a "transit" vector.
If an interrupt fires on this vector, then we call all ISRs currently
in transit between CPUs. You might end up calling ISRs that didn't
actually need service, but at least that's better than missing edges.

>
> The real critical path here is the CPU hotplug path.
>
> For regular migration between two online CPUs we use the 'migrate when
> the irq is actually serviced ' mechanism. That might have the same issue
> on misdesigned devices which are firing the next interrupt before the
> one on the flight is serviced, but I haven't seen any reports with that
> symptom yet.
>
> But before I dig deeper into this, please provide the output of
>
> 'lscpci -vvv' and 'cat /proc/interrupts'
>

Here it is:
https://pastebin.com/YyxBUvQ2

This is a Comet Lake system. It has 8 HT cores, but 4 of those cores
have already been offlined.

At the bottom of the paste I also included the script I used that
causes a repro in a minute or two. I simply run this, then put some
stress on USB. For me that stress was "join a Hangouts meeting", since
that stressed both my USB webcam and USB ethernet. The script exits
when xhci dies.
-Evan