Re: [PATCH 1/2] igc: don't rd/wr iomem when PCI is removed
From: Bjorn Helgaas
Date: Mon Jul 19 2021 - 20:45:24 EST
On Mon, Jul 19, 2021 at 12:49:18PM +1000, Oliver O'Halloran wrote:
> On Mon, Jul 19, 2021 at 8:51 AM Pali Rohár <pali@xxxxxxxxxx> wrote:
> > And do we have some solution for this kind of issue? There are more PCIe
> > controllers / platforms which do not like MMIO read/write operation when
> > card / link is not connected.
> Do you have some actual examples? The few times I've seen those
> crashes were due to broken firmware-first error handling. The AER
> notifications would be escalated into some kind of ACPI error which
> the kernel didn't have a good way of dealing with so it panicked
> Assuming it is a real problem then as Bjorn pointed out this sort of
> hack doesn't really fix the issue because hotplug and AER
> notifications are fundamentally asynchronous. If the driver is
> actively using the device when the error / removal happens then the
> pci_dev_is_disconnected() check will pass and the MMIO will go
> through. If the MMIO is poisonous because of dumb hardware then this
> sort of hack will only paper over the issue.
> > If we do not provide a way how to solve these problems then we can
> > expect that people would just hack ethernet / wifi / ... device drivers
> > which are currently crashing by patches like in this thread.
> > Maybe PCI subsystem could provide wrapper function which implements
> > above pattern and which can be used by device drivers?
> We could do that and I think there was a proposal to add some
> pci_readl(pdev, <addr>) style wrappers at one point.
Obviously this wouldn't help user-space mmaps, but in the kernel,
Documentation/driver-api/device-io.rst  does say that drivers are
supposed to use readl() et al even though on most arches it "works"
to just dereference the result of ioremap(), so maybe we could make
a useful wrapper.
Seems like we should do *something*, even if it's just a generic
#define and some examples. I took a stab at this  a couple years
ago, but it was only for the PCI core, and it didn't go anywhere.
> On powerpc
> there's hooks in the arch provided MMIO functions to detect error
> responses and kick off the error handling machinery when a problem is
> detected. Those hooks are mainly there to help the platform detect
> errors though and they don't make life much easier for drivers. Due to
> locking concerns the driver's .error_detected() callback cannot be
> called in the MMIO hook so even when the platform detects errors
> synchronously the driver notifications must happen asynchronously. In
> the meanwhile the driver still needs to handle the 0xFFs response
> safely and there's not much we can do from the platform side to help