Re: Fwd: [WARNING AND ERROR] may be system slow and audio and video breaking

From: Borislav Petkov
Date: Sun Oct 18 2020 - 17:05:59 EST


On Mon, Oct 19, 2020 at 01:51:34AM +0530, Jeffrin Jose T wrote:
> On Sun, 2020-10-18 at 19:49 +0200, Borislav Petkov wrote:
> > On Sun, Oct 18, 2020 at 10:42:39PM +0530, Jeffrin Jose T wrote:
> > > smpboot: Scheduler frequency invariance went wobbly, disabling!
> > > [ 1112.592866] unchecked MSR access error: RDMSR from 0x123 at rIP:
> > > 0xffffffffb5c9a184 (native_read_msr+0x4/0x30)

Ok, you forgot to say in your initial mail that this happens when you
suspend your laptop.

Now, this unchecked MSR error thing happens only once because that early
during resume the microcode on CPU1 is not updated yet - and that needs
to be debugged separately and I'll try to reproduce that on my machine -
so the microcode is not updated yet and therefore the 0x123 MSR is not
"emulated" by the microcode, so to speak, thus the warning.

That warning doesn't happen anymore, though, once the microcode is
updated.

But what happens after that is you get a flood of correctable PCIe
errors about a transaction to a device timeoutting:

pcieport 0000:00:1c.5: AER: Corrected error received: 0000:00:1c.5
pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
pcieport 0000:00:1c.5: device [8086:9d15] error status/mask=00001000/00002000
pcieport 0000:00:1c.5: [12] Timeout

and it looks like that flood is slowing down the machine because it is
busy logging them.

Do

# lspci -nn -xxx

as root. It'll show us which device that 8086:9d15 is.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette