Re: pcieport AER error spam on Intel Skylake

From: Bjorn Helgaas
Date: Fri Aug 05 2016 - 14:54:22 EST


On Fri, Aug 05, 2016 at 12:15:53PM -0600, Daniel Drake wrote:
> Hi Alexander,
>
> Reviving an old topic here...
>
> We are seeing this "problem" on an increasing number of units from the
> vendor, and searching around it can also be seen on Dell and HP
> products. Always with the same Realtek b723 wifi device. e.g.
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173
>
> The amount of error spam is problematic in that it slows down boot
> really significantly, while printing lots of scary messages for the
> user.
> We tried doing a PCI MSI blacklist for affected laptops but we are
> struggling to keep that blacklist updated with the increasing number
> of affected models.
>
> Enough hacks, I am wondering what we can do to solve this problem in
> the mainline kernel...

I think this is a bug in AER:
https://bugzilla.kernel.org/show_bug.cgi?id=109691

I think I understand the problem, but I haven't had time to fix it.
The bugzilla has a pointer to more details, and it would be awesome if
somebody would jump in.

> On Thu, Sep 3, 2015 at 12:05 PM, Alexander Duyck
> <alexander.duyck@xxxxxxxxx> wrote:
> > On 09/03/2015 06:32 AM, Daniel Drake wrote:
> >>
> >> On Wed, Sep 2, 2015 at 7:57 PM, Alexander Duyck
> >> <alexander.duyck@xxxxxxxxx> wrote:
> >>>
> >>> Since it is correctable errors it is likely some sort of signalling
> >>> issue.
> >>> Could we get the output of something like an lspci -vt? Then you would be
> >>> able to tell what the device is on the other side of the link from
> >>> 00:1c.5
> >>> and then we could probably check to see if there has been any changes for
> >>> the device driver on the other end of the link.
> >>
> >> "lspci -vt" reliably causes one occurance of the message, which is
> >> logged by the kernel before lspci has written anything to stdout.
> >> pcieport 0000:00:1c.5: AER: Corrected error received: id=00e5
> >> pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected,
> >> type=Physical Layer, id=00e5(Receiver ID)
> >> pcieport 0000:00:1c.5: device [8086:9d15] error
> >> status/mask=00000001/00002000
> >> pcieport 0000:00:1c.5: [ 0] Receiver Error
> >>
> >> -[0000:00]-+-00.0 Intel Corporation Device 1904
> >> +-02.0 Intel Corporation Device 1916
> >> +-04.0 Intel Corporation Device 1903
> >> +-08.0 Intel Corporation Device 1911
> >> +-14.0 Intel Corporation Device 9d2f
> >> +-14.2 Intel Corporation Device 9d31
> >> +-15.0 Intel Corporation Device 9d60
> >> +-15.1 Intel Corporation Device 9d61
> >> +-16.0 Intel Corporation Device 9d3a
> >> +-17.0 Intel Corporation Device 9d03
> >> +-1c.0-[01]--
> >> +-1c.4-[02]----00.0 Realtek Semiconductor Co., Ltd.
> >> RTL8111/8168 PCI Express Gigabit Ethernet controller
> >> +-1c.5-[03]----00.0 Realtek Semiconductor Co., Ltd. Device
> >> b723
> >> +-1f.0 Intel Corporation Device 9d48
> >> +-1f.2 Intel Corporation Device 9d21
> >> +-1f.3 Intel Corporation Device 9d70
> >> \-1f.4 Intel Corporation Device 9d23
> >>
> >> Does this mean these messages are somehow related to the Realtek b723
> >> device? That is the wifi card.
> >> Using x86_64_defconfig there is not even any driver loaded for this
> >> device, yet the messages appear quite a bit.
> >> If I use a full config with all the relevant drivers including
> >> rtlwifi, the frequency of these messages goes up a lot though.
> >
> >
> > The correctable errors are likely a result of some sort of link error
> > between the root port 00:1c.5 and the wireless adapter at 3:00.0. What is
> > likely happening is that when the device is unused it transitions down to a
> > lower power link state like L0s or L1, and when it comes out of that state
> > it is likely triggering the PCIe error most likely as a result of something
> > during the PCIe link training sequence.
> >
> > You might want to notify the manufacturer of the laptop as they may need to
> > address an issue in their hardware, firmware, or possibly add a workaround
> > to mask off Receiver Error reporting for their part via either a PCIe quirk
> > or driver fix.
> >
> >>> My suspicion since this is a laptop is that something like a power
> >>> management change might be responsible if this is a regression as I have
> >>> seen messages like this pop up as a result of ASPM being enabled before.
> >>
> >> It's likely not a regression, this is brand new hardware and this
> >> message is seen on all kernels that we have tried (4.1, 4.2, master).
> >> pcie_aspm=off also makes these messages go away.
> >
> >
> > Correctable errors are considered a sign of the PCIe link health. In theory
> > they can be ignored since by definition they can be corrected by the
> > hardware. One thing you could do if you aren't using the wireless card
> > would be to simply switch off the correctable error reporting by setting the
> > mask bit for it in configuration space using setpci.
> >
> > To do that what you could do is find the offset for the PCIe AER
> > configuration register for your port by doing a "lspci -vvv -s 0:1c.5" and
> > what you should get will be a dump listing the capabilities and their
> > current settings. In there you should find a line like:
> > Capabilities: [148 v1] Advanced Error Reporting
> >
> > The 148 is the hex offset of the configuration space. The Correctable Error
> > mask is located at a hex offset of 0x14 from there. So adding the hex
> > values 0x148 and 0x14 gives us 0x15C. To disable reporting correctable
> > receiver errors you would just want to add a 1 to whatever value you get
> > from "setpci -s 0:1c.5 0x15C.l" and then write that value back. So for
> > example on my system I ended up with something like "setpci -s 0:1c.5
> > 0x15C.l=2001" where the output from the first command was 2000.
>
> I guess this is the most concrete suggestion for how to avoid the
> issue - perhaps we can do that in rtl8723be driver probe. However, you
> mentioned above that we should only do it if we aren't using the
> wireless card. In this case we are using it... should we look for
> another approach instead?
>
> Thanks
> Daniel
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html