Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

From: Alex_Gagniuc
Date: Thu Nov 08 2018 - 17:21:13 EST


On 11/08/2018 02:09 PM, Bjorn Helgaas wrote:
>
> [EXTERNAL EMAIL]
> Please report any suspicious attachments, links, or requests for sensitive information.
>
>
> [+cc Jonathan, Greg, Lukas, Russell, Sam, Oliver for discussion about
> PCI error recovery in general]

Has anyone seen seen the ECRs in the PCIe base spec and ACPI that have
been floating around the past few months? -- HPX, SFI, CER. Without
divulging too much before publication, I'm curious on opinions on how
well (or not well) those flows would work in general, and in linux.

> On Wed, Nov 07, 2018 at 05:42:57PM -0600, Bjorn Helgaas wrote:
> I'm having second thoughts about this. One thing I'm uncomfortable
> with is that sprinkling pci_dev_is_disconnected() around feels ad hoc
> instead of systematic, in the sense that I don't know how we convince
> ourselves that this (and only this) is the correct place to put it. >
> Another is that the only place we call pci_dev_set_disconnected() is
> in pciehp and acpiphp, so the only "disconnected" case we catch is if
> hotplug happens to be involved. Every MMIO read from the device is an
> opportunity to learn whether it is reachable (a read from an
> unreachable device typically returns ~0 data), but we don't do
> anything at all with those.
>
> The config accessors already check pci_dev_is_disconnected(), so this
> patch is really aimed at MMIO accesses. I think it would be more
> robust if we added wrappers for readl() and writel() so we could
> notice read errors and avoid future reads and writes.

I wouldn't expect anything less than complete scrutiny and quality
control of unquestionable moral integrity :). In theory ~0 can be a
great indicator that something may be wrong. Though I think it's about
as ad-hoc as pci_dev_is_disconnected().

I slightly like the idea of wrapping the MMIO accessors. There's still
memcpy and DMA that cause the same MemRead/Wr PCIe transactions, and the
same sort of errors in PCIe land, and it would be good to have more
testing on this. Since this patch is tested and confirmed to fix a known
failure case, I would keep it, and the look at fixing the problem in a
more generic way.

BTW, a lot of the problems we're fixing here come courtesy of
firmware-first error handling. Do we reach a point where we draw a line
in handling new problems introduced by FFS? So, if something is a
problem with FFS, but not native handling, do we commit to supporting it?

Alex