Re: x86 PMU broken in current Linus' tree

From: Jiri Kosina
Date: Tue Aug 02 2016 - 09:33:05 EST


On Tue, 2 Aug 2016, Peter Zijlstra wrote:

> > With current Linus' tree (HEAD == 731c7d3a20), I am getting bogus MSR
> > write warning during bootup, and kernel panic when shutting PMUs down
> > during poweroff.
> >
> > The MSR warning is below, the camera capture of the poweroff panic can be
> > found at
> >
> > http://www.jikos.cz/jikos/junk/pmu-panic.jpg
> >
> > The last previous kernel version that I've booted on this particular
> > machine was 4.7.0-rc4, and it had neither of those symptoms, so I can
> > eventually bisect if needed.
> >
> > === [ snip ] ==
> > [ 0.136000] smpboot: CPU0: Intel(R) Core(TM)2 Duo CPU L9400 @ 1.86GHz (family: 0x6, model: 0x17, stepping: 0x6)
> > [ 0.136000] Performance Events: PEBS fmt0+, Core2 events, Intel PMU driver.
> > [ 0.136000] ... version: 2
> > [ 0.136000] ... bit width: 40
> > [ 0.136000] ... generic registers: 2
> > [ 0.136000] ... value mask: 000000ffffffffff
> > [ 0.136000] ... max period: 000000007fffffff
> > [ 0.136000] ... fixed-purpose events: 3
> > [ 0.136000] ... event mask: 0000000700000003
> > [ 0.136000] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
> > [ 0.136000] unchecked MSR access error: WRMSR to 0xdf (tried to write 0x000000ff80000001) at rIP: 0xffffffff90004acc (x86_perf_event_set_period+0xdc/0x190)
>
> 'Curious'.. :/
>
> x86_perf_event_set_period() only does:
>
> wrmsrl(hwc->event_base, (u64)(-left) & x86_pmu.cntval_mask);
>
> and hwc->event ends up being:
>
> MSR_ARCH_PERFMON_PERFCTR0 + index
>
> From which we can deduce that index = 0xdf - 0xc1 = 30, which is
> somewhat larger than the max reported number of counters (2).
>
> Lemme go see how that can happen.

FTR, I tried the very same kernel on Xeon E5, and the issue didn't pop up.
So it might be somehow specific to the older Core2, or somehow otherwise
not really completely generic problem.

--
Jiri Kosina
SUSE Labs