Re: 2.6.31-rc5 regression: x86 MCE malfunction on Thinkpad T42p

From: Johannes Stezenbach
Date: Mon Aug 10 2009 - 18:13:20 EST


On Mon, Aug 10, 2009 at 11:31:33PM +0200, Ingo Molnar wrote:
> * Johannes Stezenbach <js@xxxxxxxxx> wrote:
> >
> > # cat /proc/cpuinfo
> > processor : 0
> > vendor_id : GenuineIntel
> > cpu family : 6
> > model : 13
> > model name : Intel(R) Pentium(R) M processor 1.80GHz
>
> ah, yes. There's no cache-references/misses, because in
> arch/x86/kernel/cpu/perf_counter.c we have two zero entries:
>
> static const u64 p6_perfmon_event_map[] =
> {
> [PERF_COUNT_HW_CPU_CYCLES] = 0x0079,
> [PERF_COUNT_HW_INSTRUCTIONS] = 0x00c0,
> [PERF_COUNT_HW_CACHE_REFERENCES] = 0x0000, <----------
> [PERF_COUNT_HW_CACHE_MISSES] = 0x0000, <----------
> [PERF_COUNT_HW_BRANCH_INSTRUCTIONS] = 0x00c4,
> [PERF_COUNT_HW_BRANCH_MISSES] = 0x00c5,
> [PERF_COUNT_HW_BUS_CYCLES] = 0x0062,
> };
>
> i.e. PERF_COUNT_HW_CACHE_REFERENCES and PERF_COUNT_HW_CACHE_MISSES
> is not filled in yet.
>
> Could you try something like:
>
> perf stat -e r0f2e true
>
> (0x2e: L2 requests, 0x0f: all units)
>
> if i checked the docs right that counter would give us L2 cache
> stats - does it display non-zero values?

# ./perf stat -e r0f2e true

Performance counter stats for 'true':

10584 raw 0xf2e

0.001159924 seconds time elapsed

The number also increases for larger programs than "true".

According to /usr/share/oprofile/i386/p6_mobile/events and
http://oprofile.sourceforge.net/docs/intel-p6-mobile-events.php
0x2e + 0x0f is "L2 requests, all units", but I couldn't say how
to count cache references vs. misses. Or does it work
with unit mask 0x0e vs. 0x01?

# ./perf stat -e r0e2e true

Performance counter stats for 'true':

10147 raw 0xe2e

0.001121651 seconds time elapsed

# ./perf stat -e r012e true

Performance counter stats for 'true':

468 raw 0x12e

0.001130870 seconds time elapsed


> > Could the warning be caused by the cpufreq ondemand governor? ISTR
> > that one should switch to the performance governor before doing
> > any profiling, but I forgot for this test.
>
> there might be a connection - it could in theory cause sched_clock()
> transients and confuse the ring-buffer time-stamping.

I'll try tomorrow after a fresh boot if the warning also appears
with the performance governor.


Thanks
Johannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/