Re: unknown NMI on AMD Rome

From: Ingo Molnar
Date: Wed Mar 17 2021 - 04:49:08 EST



* Kim Phillips <kim.phillips@xxxxxxx> wrote:

> On 3/16/21 2:53 PM, Peter Zijlstra wrote:
> > On Tue, Mar 16, 2021 at 04:45:02PM +0100, Jiri Olsa wrote:
> >> hi,
> >> when running 'perf top' on AMD Rome (/proc/cpuinfo below)
> >> with fedora 33 kernel 5.10.22-200.fc33.x86_64
> >>
> >> we got unknown NMI messages:
> >>
> >> [ 226.700160] Uhhuh. NMI received for unknown reason 3d on CPU 90.
> >> [ 226.700162] Do you have a strange power saving mode enabled?
> >> [ 226.700163] Dazed and confused, but trying to continue
> >> [ 226.769565] Uhhuh. NMI received for unknown reason 3d on CPU 84.
> >> [ 226.769566] Do you have a strange power saving mode enabled?
> >> [ 226.769567] Dazed and confused, but trying to continue
> >> [ 226.769771] Uhhuh. NMI received for unknown reason 2d on CPU 24.
> >> [ 226.769773] Do you have a strange power saving mode enabled?
> >> [ 226.769774] Dazed and confused, but trying to continue
> >> [ 226.812844] Uhhuh. NMI received for unknown reason 2d on CPU 23.
> >> [ 226.812846] Do you have a strange power saving mode enabled?
> >> [ 226.812847] Dazed and confused, but trying to continue
> >> [ 226.893783] Uhhuh. NMI received for unknown reason 2d on CPU 27.
> >> [ 226.893785] Do you have a strange power saving mode enabled?
> >> [ 226.893786] Dazed and confused, but trying to continue
> >> [ 226.900139] Uhhuh. NMI received for unknown reason 2d on CPU 40.
> >> [ 226.900141] Do you have a strange power saving mode enabled?
> >> [ 226.900143] Dazed and confused, but trying to continue
> >> [ 226.908763] Uhhuh. NMI received for unknown reason 3d on CPU 120.
> >> [ 226.908765] Do you have a strange power saving mode enabled?
> >> [ 226.908766] Dazed and confused, but trying to continue
> >> [ 227.751296] Uhhuh. NMI received for unknown reason 2d on CPU 83.
> >> [ 227.751298] Do you have a strange power saving mode enabled?
> >> [ 227.751299] Dazed and confused, but trying to continue
> >> [ 227.752937] Uhhuh. NMI received for unknown reason 3d on CPU 23.
> >>
> >> also when discussing ths with Borislav, he managed to reproduce easily
> >> on his AMD Rome machine
> >>
> >> any idea?
> >
> > Kim is the AMD point person for this I think..
>
> Since perf top invokes precision and therefore IBS,
> this looks like it's hitting erratum #1215:
>
> https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf

So:


1215 IBS (Instruction Based Sampling) Counter Valid Value
May be Incorrect After Exit From Core C6 (CC6) State

Description

If a core's IBS feature is enabled and configured to generate an interrupt, including NMI (Non-Maskable
Interrupt), and the IBS counter overflows during the entry into the Core C6 (CC6) state, the interrupt may be
issued, but an invalid value of the valid bit may be restored when the core exits CC6.
Potential Effect on System

The operating system may receive interrupts due to an IBS counter event, including NMI, and not observe an
valid IBS register. Console messages indicating "NMI received for unknown reason" have been observed on
Linux systems.

Suggested Workaround: None
Fix Planned: No fix planned

lovely.

Thanks,

Ingo