Re: [PATCH] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling

From: HATAYAMA Daisuke
Date: Thu Jun 12 2014 - 02:46:58 EST


From: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Subject: Re: [PATCH] perf/x86/intel: ignore CondChgd bit to avoid false NMI handling
Date: Wed, 11 Jun 2014 10:54:48 +0200

> On Wed, Jun 11, 2014 at 04:30:28PM +0900, HATAYAMA Daisuke wrote:
>> Currently, a NMI handler for NMI watchdog may falsely handle any NMI
>> signaled for different purpose if CondChgd bit in
>> MSR_CORE_PERF_GLOBAL_STATUS MSR is set.
>>
>> This commit deals with the issue simply by ignoring CondChgd bit.
>>
>> Here is explanation in detail.
>>
>> On x86 NMI watchdog uses performance monitoring feature to
>> periodically signal NMI each time performance counter gets overflowed.
>>
>> intel_pmu_handle_irq() is called as a NMI_LOCAL handler from a NMI
>> handler of NMI watchdog, perf_event_nmi_handler(). It identifies owner
>> of a given NMI by looking at overflow status bits in
>> MSR_CORE_PERF_GLOBAL_STATUS MSR. If some of the bits are set, then it
>> handles the given NMI as its own NMI.
>>
>> The problem is that intel_pmu_handle_irq() doesn't distinguish
>> CondChgd bit from other bits. Unlike the other status bits, CondChgd
>> bit doesn't represent overflow status for performance counters. Thus,
>> CondChgd bit cannot be thought of as a mark indicating a given NMI is
>> NMI watchdog's.
>
> So what was the problem? It ate another NMI?
>

Yes.

The handler for NMI watchdog is called in NMI_LOCAL. This is done in
earlier timing than NMI_UNKNOWN, and if some of the handlers in
NMI_LOCAL handles at least one NMI, then handlers in NMI_UNKNOWN are
not called. Like this:

handled = nmi_handle(NMI_LOCAL, regs, b2b);
__this_cpu_add(nmi_stats.normal, handled);
if (handled) {
/*
* There are cases when a NMI handler handles multiple
* events in the current NMI. One of these events may
* be queued for in the next NMI. Because the event is
* already handled, the next NMI will result in an unknown
* NMI. Instead lets flag this for a potential NMI to
* swallow.
*/
if (handled > 1)
__this_cpu_write(swallow_nmi, true);
return;
}

For example, we use unknown NMI to make system panic by enabling
kernel.unkown_nmi_panic. Without this fix, unknown NMI is robbed by
NMI handler for NMI watchdog.

>> I noticed this behavior on systems with Ivy Bridge processors: Intel
>> Xeon CPU E5-2630 v2 and Intel Xeon CPU E7-8890 v2. On both systems,
>> CondChgd bit in MSR_CORE_PERF_GLOBAL_STATUS MSR has already been set
>> in the beginning at boot. (then the CondChgd bit is cleared by next
>> wrmsr to MSR_CORE_PERF_GLOBAL_CTRL MSR and appears to remain 0.)
>>
>> On the other hand, on older processors such as Nehalem, CondChgd bit
>> is not set in the beginning at boot.
>>
>> I'm not sure about exact behavior of CondChgd bit, in particular when
>> this bit is set. Although I read Intel System Programmer's Manual to
>> figure out but I have yet completed that. At least, I think ignoring
>> CondChgd bit should be enough for NMI watchdog perspective.
>
> So yes, the SDM lists the bit as existing but never once mentions it
> outside of that, and its been doing that at least back to 2008.
>
> Ooh, I found it:
>
> "The IA32_PERF_GLOBAL_STATUS MSR also provides a ʽsticky bitʼ to
> indicate changes to the state of performance monitoring hardware (see
> Figure 18-29)."
>
> Which is of course completely useless, not to mention inconsistent with
> the later CondChgd name.
>
> HPA, can you explain wtf that bit does and why hatayama-san's ivb feels
> like having that set on boot?

--
Thanks.
HATAYAMA, Daisuke