Re: [PATCH 5.4 182/389] PCI/portdrv: Dont disable AER reporting in get_port_device_capability()
From: Ben Greear
Date: Wed Mar 29 2023 - 19:17:50 EST
On 8/30/22 3:16 PM, Ben Greear wrote:
On 8/30/22 2:55 PM, Pali Rohár wrote:
On Tuesday 30 August 2022 14:28:14 Ben Greear wrote:
On 8/30/22 1:58 PM, Pali Rohár wrote:
On Tuesday 30 August 2022 13:47:48 Ben Greear wrote:
On 8/23/22 11:41 PM, Greg Kroah-Hartman wrote:
On Tue, Aug 23, 2022 at 07:20:14AM -0500, Bjorn Helgaas wrote:
On Tue, Aug 23, 2022, 6:35 AM Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
wrote:
From: Stefan Roese <sr@xxxxxxx>
[ Upstream commit 8795e182b02dc87e343c79e73af6b8b7f9c5e635 ]
There's an open regression related to this commit:
https://bugzilla.kernel.org/show_bug.cgi?id=216373
This is already in the following released stable kernels:
5.10.137 5.15.61 5.18.18 5.19.2
I'll go drop it from the 4.19 and 5.4 queues, but when this gets
resolved in Linus's tree, make sure there's a cc: stable on the fix so
that we know to backport it to the above branches as well. Or at the
least, a "Fixes:" tag.
This is still in 5.19.5. We saw some funny iwlwifi crashes in 5.19.3+
that we did not see in 5.19.0+. I just bisected the scary looking AER errors to this
patch, though I do not know for certain if it causes the iwlwifi related crashes yet.
In general, from reading the commit msg, this patch doesn't seem to be a great candidate
for stable in general. Does it fix some important problem?
In case it helps, here is example of what I see in dmesg. The kernel crashes in iwlwifi
had to do with rx messages from the firmware, and some warnings lead me to believe that
pci messages were slow coming back and/or maybe duplicated. So maybe this AER patch changes
timing or otherwise screws up the PCI adapter boards we use...
From that log I have feeling that issue is in that intel wifi card and
it was there also before that commit. Card is crashing (or something
other happens on PCIe bus) and because kernel had disabled Error
Reporting for this card, nobody spotted any issue. And that commit just
opened eye to kernel to see those errors.
I think this issue should be reported to intel wifi card developers,
maybe they comment it, why card is reporting errors.
My main concern is not that AER messages started showing up, but that there
started being kernel NPE and WARNINGS showing up sometime after 5.19.0.
Possibly this AER thing is mis-direction and the real bug is elsewhere,
but since the bugzilla also indicated (different) driver crashes, then
I am suspicious this changes things more significantly, at least in a subset
of hardware out there.
Yea, of course, this is something needed to investigate.
Anyway, do you see driver crashes? Or just these AER errors? And are
your PCIe cards working, or after seeing these messages in dmesg they
stopped working? It is needed to know if you are just spammed by tons of
lines in dmesg and otherwise everything works. Or if after AER errors
your PCIe devices stop working and rebooting system is required.
We did see higher frequency of weird crashes (accessing null-ish pointer) after upgrading to 5.19.3,
I am building kernel now with 5.19.5 and that AER patch reverted. We will
test to see if that solves the crashes.
Also, any idea what this error in my logs is actually indicating?
Your PCIe controller received non-fatal, but uncorrected error. There is
also indication of Unsupported Request Completion Status. Unsupported
Request is generated by PCIe device when controller / host / kernel try
to do something which is not supported by device; pretty generic error.
PCIe base spec describe lot of scenarios when card should return this
error. Maybe some more detailed information are in TLP Header hexdump,
but I cannot decode it now.
Basically it is PCIe card driver who could know how fatal it is that
issue and how to recover from it. But as you can see intel wifi driver
does not implement that callback.
Hello,
I notice this patch appears to be in 6.2.6 kernel, and my kernel logs are
full of spam and system is unstable. Possibly the unstable part is related
to something else, but the log spam is definitely extreme.
These systems are fairly stable on 5.19-ish kernels without the patch in
question.
Any suggested cures for this other than reverting the patch?
Here is sample of the spam:
[ 1675.547023] pcieport 0000:03:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 1675.556851] pcieport 0000:03:02.0: device [10b5:8619] error status/mask=00100000/00000000
[ 1675.563904] pcieport 0000:03:02.0: [20] UnsupReq (First)
[ 1675.569398] pcieport 0000:03:02.0: AER: TLP Header: 34000000 05001f10 00000000 88c888c8
[ 1675.576296] iwlwifi 0000:05:00.0: AER: can't recover (no error_detected callback)
[ 1675.576302] pcieport 0000:03:02.0: AER: device recovery failed
[ 1675.576303] pcieport 0000:00:1c.0: AER: Uncorrected (Non-Fatal) error received: 0000:03:02.0
[ 1675.576317] pcieport 0000:03:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 1675.586144] pcieport 0000:03:02.0: device [10b5:8619] error status/mask=00100000/00000000
[ 1675.593196] pcieport 0000:03:02.0: [20] UnsupReq (First)
[ 1675.598691] pcieport 0000:03:02.0: AER: TLP Header: 34000000 05001f10 00000000 88c888c8
[ 1675.605584] iwlwifi 0000:05:00.0: AER: can't recover (no error_detected callback)
[ 1675.605588] pcieport 0000:03:02.0: AER: device recovery failed
[ 1676.497155] pcieport 0000:00:1c.0: AER: Uncorrected (Non-Fatal) error received: 0000:03:02.0
[ 1676.497174] pcieport 0000:03:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 1676.507015] pcieport 0000:03:02.0: device [10b5:8619] error status/mask=00100000/00000000
[ 1676.514091] pcieport 0000:03:02.0: [20] UnsupReq (First)
[ 1676.519599] pcieport 0000:03:02.0: AER: TLP Header: 34000000 05001f10 00000000 88c888c8
[ 1676.526491] iwlwifi 0000:05:00.0: AER: can't recover (no error_detected callback)
[ 1676.526516] pcieport 0000:03:02.0: AER: device recovery failed
[ 1676.526517] pcieport 0000:00:1c.0: AER: Uncorrected (Non-Fatal) error received: 0000:03:02.0
[ 1676.526531] pcieport 0000:03:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 1676.536367] pcieport 0000:03:02.0: device [10b5:8619] error status/mask=00100000/00000000
[ 1676.543440] pcieport 0000:03:02.0: [20] UnsupReq (First)
[ 1676.548936] pcieport 0000:03:02.0: AER: TLP Header: 34000000 05001f10 00000000 88c888c8
[ 1676.555830] iwlwifi 0000:05:00.0: AER: can't recover (no error_detected callback)
[ 1676.555850] pcieport 0000:03:02.0: AER: device recovery failed
[ 1676.555851] pcieport 0000:00:1c.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:02.0
[ 1676.555955] pcieport 0000:03:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 1676.565792] pcieport 0000:03:01.0: device [10b5:8619] error status/mask=00100000/00000000
[ 1676.572846] pcieport 0000:03:01.0: [20] UnsupReq (First)
[ 1676.578344] pcieport 0000:03:01.0: AER: TLP Header: 34000000 04001f10 00000000 88c888c8
[ 1676.585268] pcieport 0000:03:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 1676.595105] pcieport 0000:03:02.0: device [10b5:8619] error status/mask=00100000/00000000
[ 1676.602162] pcieport 0000:03:02.0: [20] UnsupReq (First)
[ 1676.607655] pcieport 0000:03:02.0: AER: TLP Header: 34000000 05001f10 00000000 88c888c8
[ 1676.614538] pcieport 0000:03:02.0: AER: Error of this Agent is reported first
[ 1676.620566] pcieport 0000:03:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 1676.630398] pcieport 0000:03:03.0: device [10b5:8619] error status/mask=00100000/00000000
[ 1676.637454] pcieport 0000:03:03.0: [20] UnsupReq (First)
[ 1676.642945] pcieport 0000:03:03.0: AER: TLP Header: 34000000 06001f10 00000000 88c888c8
[ 1676.649840] pcieport 0000:03:05.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 1676.659681] pcieport 0000:03:05.0: device [10b5:8619] error status/mask=00100000/00000000
[ 1676.666738] pcieport 0000:03:05.0: [20] UnsupReq (First)
[ 1676.672253] pcieport 0000:03:05.0: AER: TLP Header: 34000000 07001f10 00000000 88c888c8
[ 1676.679172] pcieport 0000:03:07.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 1676.689002] pcieport 0000:03:07.0: device [10b5:8619] error status/mask=00100000/00000000
[ 1676.696055] pcieport 0000:03:07.0: [20] UnsupReq (First)
[ 1676.701550] pcieport 0000:03:07.0: AER: TLP Header: 34000000 08001f10 00000000 88c888c8
[ 1676.708461] iwlwifi 0000:04:00.0: AER: can't recover (no error_detected callback)
[ 1676.708467] pcieport 0000:03:01.0: AER: device recovery failed
[ 1676.708480] iwlwifi 0000:05:00.0: AER: can't recover (no error_detected callback)
[ 1676.708483] pcieport 0000:03:02.0: AER: device recovery failed
[ 1676.708496] iwlwifi 0000:06:00.0: AER: can't recover (no error_detected callback)
[ 1676.708499] pcieport 0000:03:03.0: AER: device recovery failed
[ 1676.708511] iwlwifi 0000:07:00.0: AER: can't recover (no error_detected callback)
[ 1676.708515] pcieport 0000:03:05.0: AER: device recovery failed
[ 1676.708541] iwlwifi 0000:08:00.0: AER: can't recover (no error_detected callback)
[ 1676.708544] pcieport 0000:03:07.0: AER: device recovery failed
[ 1676.893674] pcieport 0000:00:1c.0: AER: Uncorrected (Non-Fatal) error received: 0000:03:02.0
[ 1676.893692] pcieport 0000:03:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 1676.903527] pcieport 0000:03:02.0: device [10b5:8619] error status/mask=00100000/00000000
[ 1676.910584] pcieport 0000:03:02.0: [20] UnsupReq (First)
[ 1676.916098] pcieport 0000:03:02.0: AER: TLP Header: 34000000 05001f10 00000000 88c888c8
[ 1676.923010] iwlwifi 0000:05:00.0: AER: can't recover (no error_detected callback)
[ 1676.923018] pcieport 0000:03:02.0: AER: device recovery failed
[ 1676.923018] pcieport 0000:00:1c.0: AER: Uncorrected (Non-Fatal) error received: 0000:03:02.0
[ 1676.923046] pcieport 0000:03:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 1676.932876] pcieport 0000:03:02.0: device [10b5:8619] error status/mask=00100000/00000000
[ 1676.939929] pcieport 0000:03:02.0: [20] UnsupReq (First)
[ 1676.945425] pcieport 0000:03:02.0: AER: TLP Header: 34000000 05001f10 00000000 88c888c8
[ 1676.952319] iwlwifi 0000:05:00.0: AER: can't recover (no error_detected callback)
[ 1676.952325] pcieport 0000:03:02.0: AER: device recovery failed
[ 1676.952325] pcieport 0000:00:1c.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:03:02.0
[ 1676.952462] pcieport 0000:03:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 1676.962292] pcieport 0000:03:01.0: device [10b5:8619] error status/mask=00100000/00000000
[ 1676.969347] pcieport 0000:03:01.0: [20] UnsupReq (First)
[ 1676.974839] pcieport 0000:03:01.0: AER: TLP Header: 34000000 04001f10 00000000 88c888c8
[ 1676.981734] pcieport 0000:03:02.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 1676.991560] pcieport 0000:03:02.0: device [10b5:8619] error status/mask=00100000/00000000
[ 1676.998614] pcieport 0000:03:02.0: [20] UnsupReq (First)
[ 1677.004107] pcieport 0000:03:02.0: AER: TLP Header: 34000000 05001f10 00000000 88c888c8
[ 1677.010991] pcieport 0000:03:02.0: AER: Error of this Agent is reported first
[ 1677.017014] pcieport 0000:03:03.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 1677.026841] pcieport 0000:03:03.0: device [10b5:8619] error status/mask=00100000/00000000
[ 1677.033894] pcieport 0000:03:03.0: [20] UnsupReq (First)
[ 1677.039390] pcieport 0000:03:03.0: AER: TLP Header: 34000000 06001f10 00000000 88c888c8
[ 1677.046292] pcieport 0000:03:05.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 1677.056118] pcieport 0000:03:05.0: device [10b5:8619] error status/mask=00100000/00000000
[ 1677.063174] pcieport 0000:03:05.0: [20] UnsupReq (First)
[ 1677.068667] pcieport 0000:03:05.0: AER: TLP Header: 34000000 07001f10 00000000 88c888c8
[ 1677.075575] pcieport 0000:03:07.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 1677.085402] pcieport 0000:03:07.0: device [10b5:8619] error status/mask=00100000/00000000
[ 1677.092457] pcieport 0000:03:07.0: [20] UnsupReq (First)
[ 1677.097951] pcieport 0000:03:07.0: AER: TLP Header: 34000000 08001f10 00000000 88c888c8
[ 1677.104844] iwlwifi 0000:04:00.0: AER: can't recover (no error_detected callback)
[ 1677.104849] pcieport 0000:03:01.0: AER: device recovery failed
[ 1677.104881] iwlwifi 0000:05:00.0: AER: can't recover (no error_detected callback)
[ 1677.104884] pcieport 0000:03:02.0: AER: device recovery failed
[ 1677.104908] iwlwifi 0000:06:00.0: AER: can't recover (no error_detected callback)
[ 1677.104911] pcieport 0000:03:03.0: AER: device recovery failed
[ 1677.104938] iwlwifi 0000:07:00.0: AER: can't recover (no error_detected callback)
[ 1677.104943] pcieport 0000:03:05.0: AER: device recovery failed
[ 1677.104968] iwlwifi 0000:08:00.0: AER: can't recover (no error_detected callback)
[ 1677.104971] pcieport 0000:03:07.0: AER: device recovery failed
Thanks,
Ben