[BUG] Core2 cpu triggers hard lockup with perf test

From: Jiri Olsa
Date: Sat Feb 27 2016 - 07:37:22 EST


hi,
we are getting hard lockups on Core2 cpus (model 23)
just by running 'perf test'

PID: 10425 TASK: ffff880068562e00 CPU: 3 COMMAND: "perf"
#0 [ffff88007d985a08] machine_kexec at ffffffff8105521b
#1 [ffff88007d985a68] crash_kexec at ffffffff810f7412
#2 [ffff88007d985b38] panic at ffffffff8163c031
#3 [ffff88007d985bb8] watchdog_overflow_callback at ffffffff81120472
#4 [ffff88007d985bc8] __perf_event_overflow at ffffffff81164e0e
#5 [ffff88007d985c00] perf_event_overflow at ffffffff81165a44
#6 [ffff88007d985c10] intel_pmu_handle_irq at ffffffff81033198
#7 [ffff88007d985e60] perf_event_nmi_handler at ffffffff8164be8b
#8 [ffff88007d985e80] nmi_handle at ffffffff8164b5d9
#9 [ffff88007d985ec8] do_nmi at ffffffff8164b789
#10 [ffff88007d985ef0] end_repeat_nmi at ffffffff8164aa13
[exception RIP: intel_pmu_enable_all+17]
RIP: ffffffff81032301 RSP: ffff88005e917c98 RFLAGS: 00000046
RAX: ffff88007d98cd20 RBX: ffff88005e991000 RCX: 000000000000038f
RDX: 0000000000000007 RSI: 0000000000000003 RDI: 0000000000000000
RBP: ffff88005e917cd8 R8: ffffffffffffff85 R9: 000000ffffffffff
R10: ffff88007d98c100 R11: ffff88005e9179e0 R12: ffff88007d98bd10
R13: ffff88007d98b9e0 R14: ffff88007d98bc08 R15: 0000000000000002
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#11 [ffff88005e917c98] intel_pmu_enable_all at ffffffff81032301
#12 [ffff88005e917c98] x86_pmu_enable at ffffffff8102ba24
#13 [ffff88005e917ce0] perf_pmu_enable at ffffffff81160457
#14 [ffff88005e917cf0] perf_event_context_sched_in at ffffffff81161930
#15 [ffff88005e917d20] perf_event_exec at ffffffff811621db
#16 [ffff88005e917d68] setup_new_exec at ffffffff811edffd
#17 [ffff88005e917d88] load_elf_binary at ffffffff81240ed9
#18 [ffff88005e917e58] search_binary_handler at ffffffff811ec89d
#19 [ffff88005e917ea0] do_execve_common at ffffffff811ede04
#20 [ffff88005e917f30] sys_execve at ffffffff811ee199
#21 [ffff88005e917f50] stub_execve at ffffffff816531a9

the reproducer seems to be hw event with very small
period like (thanks Arnaldo ;-):
perf record -e cycles -c 123 kill

I bisected it down to the:
156174999dd1 perf/intel/x86: Enlarge the PEBS buffer

Looks like the bigger PEBS buffer together with event being
marked as PERF_X86_EVENT_FREERUNNING will block the CPU right
after the event is enabled before it could reach local_irq_enable
and trigger the NMI watchdog.

I can't find what's special about Core2 CPU PEBS setup,
it seems that oher CPUs are ok (tried on ivb/snb/hsw).

reverting the 156174999dd1 fixed the issue for me

ideas? thanks,
jirka