Re: unknown NMI on AMD Rome

From: Alexander Monakov
Date: Wed Mar 17 2021 - 09:33:09 EST


On Wed, 17 Mar 2021, Peter Zijlstra wrote:

> On Wed, Mar 17, 2021 at 09:48:29AM +0100, Ingo Molnar wrote:
> > > https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf
> >
> > So:
> >
> >
> > 1215 IBS (Instruction Based Sampling) Counter Valid Value
> > May be Incorrect After Exit From Core C6 (CC6) State
> >
> > Description
> >
> > If a core's IBS feature is enabled and configured to generate an interrupt, including NMI (Non-Maskable
> > Interrupt), and the IBS counter overflows during the entry into the Core C6 (CC6) state, the interrupt may be
> > issued, but an invalid value of the valid bit may be restored when the core exits CC6.
> > Potential Effect on System
> >
> > The operating system may receive interrupts due to an IBS counter event, including NMI, and not observe an
> > valid IBS register. Console messages indicating "NMI received for unknown reason" have been observed on
> > Linux systems.
> >
> > Suggested Workaround: None
> > Fix Planned: No fix planned
>
> Should be simple enough to disable CC6 while IBS is in use. Kim, can you
> please make that happen?

Wouldn't that "magically" significantly speed up workloads running under
'perf top', in case they don't saturate the CPUs? Scheduling gets
much snappier if the target CPU doesn't need to wake up from deep sleep :)

Alternatively, would you consider adding the errata reference to the
printk message when IBS is in use, and rate-limit it so it doesn't
flood dmesg? Then the user will know what's going on, and may
choose to temporarily disable C-states using the 'cpupower' tool.

Alexander