Re: 5.7 regression: Lots of PCIe AER errors and suspend failure without pcie=noaer

From: Robert Hancock
Date: Tue Jul 21 2020 - 19:55:49 EST


On Fri, Jul 10, 2020 at 6:28 PM Robert Hancock <hancockrwd@xxxxxxxxx> wrote:
>
> On Fri, Jul 10, 2020 at 6:23 PM Robert Hancock <hancockrwd@xxxxxxxxx> wrote:
> >
> > Noticed a problem on my desktop with an Asus PRIME H270-PRO
> > motherboard after Fedora 32 upgraded to the 5.7 kernel (now on 5.7.8):
> > periodically there are PCIe AER errors getting spewed in dmesg that
> > weren't happening before, and this also seems to causes suspend to
> > fail - the system just wakes back up again right away, I am assuming
> > due to some AER errors interrupting the process. 5.6 kernels didn't
> > have this problem. Setting "pcie=noaer" on the kernel command line
> > works around the issue, but I'm not sure what would have changed to
> > trigger this to occur?
>
> Correction: the workaround option is "pci=noaer".

As a follow-up, from some more experimentation, it appears that
disabling PCIe ASPM with setpci on both the ASMedia PCIe-PCI bridge as
well as the PCIe root port it is connected to seems to silence the AER
errors and allow suspend/resume to work again:

setpci -s 00:1c.0 0x50.B=0x00
setpci -s 02:00.0 0x90.B=0x00

It appears the behavior changed as a result of this patch (which went
into the stable tree for 5.7.6 and so affects 5.7 kernels as well):

commit 66ff14e59e8a30690755b08bc3042359703fb07a
Author: Kai-Heng Feng <kai.heng.feng@xxxxxxxxxxxxx>
Date: Wed May 6 01:34:21 2020 +0800

PCI/ASPM: Allow ASPM on links to PCIe-to-PCI/PCI-X Bridges

7d715a6c1ae5 ("PCI: add PCI Express ASPM support") added the ability for
Linux to enable ASPM, but for some undocumented reason, it didn't enable
ASPM on links where the downstream component is a PCIe-to-PCI/PCI-X Bridge.

Remove this exclusion so we can enable ASPM on these links.

The Dell OptiPlex 7080 mentioned in the bugzilla has a TI XIO2001
PCIe-to-PCI Bridge. Enabling ASPM on the link leading to it allows the
Intel SoC to enter deeper Package C-states, which is a significant power
savings.

[bhelgaas: commit log]
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=207571
Link: https://lore.kernel.org/r/20200505173423.26968-1-kai.heng.feng@xxxxxxxxxxxxx
Signed-off-by: Kai-Heng Feng <kai.heng.feng@xxxxxxxxxxxxx>
Signed-off-by: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>
Reviewed-by: Mika Westerberg <mika.westerberg@xxxxxxxxxxxxxxx>

Unfortunately it appears that this ASMedia PCIe-PCI bridge:

02:00.0 PCI bridge [0604]: ASMedia Technology Inc. ASM1083/1085 PCIe
to PCI Bridge [1b21:1080] (rev 04)

doesn't cope with ASPM properly and causes a bunch of PCIe link
errors. (This is in addition to some broken-ness known as far back as
2012 with these ASM1083/1085 chips with regard to PCI interrupts
getting stuck, but this ASPM problem causes issues even if no devices
are connected to the PCI side of the bridge, as is the case on my
system.)

Might need a quirk to disable ASPM on this device?