Re: [PATCH v2] mmc: sdhci-pci-gli: GL975x: Mask rootport's replay timer timeout during suspend

From: Bjorn Helgaas
Date: Tue Mar 26 2024 - 17:20:07 EST


On Tue, Mar 26, 2024 at 09:52:28AM +0800, Kai-Heng Feng wrote:
> On Tue, Mar 26, 2024 at 3:02 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> > On Mon, Mar 25, 2024 at 10:02:27AM +0800, Kai-Heng Feng wrote:
> > > On Sat, Mar 23, 2024 at 12:43 AM Bjorn Helgaas <helgaas@xxxxxxxxxx> wrote:
> ...

> > > > If that's the case, why do the
> > > > masking in the suspend/resume callbacks?
> > >
> > > Because there's no functional impact when the error happens, other
> > > than suspend/resume.
> >
> > Oh, I think I see. Is this accurate?
> >
> > Due to a hardware defect in GL975x, config accesses when ASPM is
> > enabled frequently cause Replay Timer Timeouts in the Port leading
> > to the device.
> >
> > These are Correctable Errors, so the Downstream Port logs it in its
> > PCI_ERR_COR_STATUS and, when the error is not masked, sends an
> > ERR_COR message upstream. The message terminates at a Root Port,
> > which may generate an AER interrupt so the OS can log it.
> >
> > The Correctable Error logging is an annoyance but normally not a
> > major issue. But when the AER interrupt happens during suspend, it
> > can prevent the system from suspending.
>
> That's totally the case here.
>
> This brings up another different but related topic - should the port
> driver disable AER/DPC IRQ during suspend?
> We've discussed this many times, I still think that's the right
> approach to "quiesce" many unexpected errors during system state
> transition.

Maybe so. We can continue that in the context of that patch. Maybe
it needs to be reposted; I can't remember where it's at right now.

> > I think we should log a hint in dmesg that we're masking
> > PCI_ERR_COR_REP_TIMER because the error will still be logged in the
> > PCI_ERR_COR_STATUS register, and that will be visible via lspci, and a
> > dmesg hint will save debugging time when people report that.
>
> Sure. Where do you think it's a better place to implement the quirk? I
> Assume PCI quirk is a better place than driver's probe routine?

Yes, I think drivers/pci/quirks.c is a better place so we can mask it
even if the driver isn't loaded. Users can still run lspci and see
these errors even if the driver isn't loaded.

Bjorn