RE: MCE Bug?

From: Luck, Tony
Date: Wed Jun 17 2015 - 19:53:59 EST


> if you want to give those changes a run, I've uploaded them here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras.git#tip-ras

Latest experiments show that sometimes checking kventd_up() before calling schedule_work()
helps ... but mostly only when I fake some early logs from low numbered cpus. I added some
traces to the real case of a left-over fatal error and got this splat:

[ 0.331551] smpboot: CPU0: Intel(R) Xeon(R) CPU E7-8890 v3 @ 2.50GHz (fam: 06, model: 3f, stepping: 04)
[ 0.342117] Performance Events: PEBS fmt2+, 16-deep LBR, Haswell events, full-width counters, Intel PMU driver.
[ 0.353471] ... version: 3
[ 0.357948] ... bit width: 48
[ 0.362523] ... generic registers: 4
[ 0.367000] ... value mask: 0000ffffffffffff
[ 0.372935] ... max period: 0000ffffffffffff
[ 0.378870] ... fixed-purpose events: 3
[ 0.383347] ... event mask: 000000070000000f
[ 0.392357] x86: Booting SMP configuration:
[ 0.397031] .... node #0, CPUs: #1
[ 0.423373] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter.
[ 0.432705] #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17
[ 0.706878] .... node #1, CPUs: #18 #19 #20 #21 #22 #23 #24 #25 #26 #27 #28 #29 #30 #31 #32 #33 #34 #35
[ 1.094625] .... node #2, CPUs: #36
[ 1.112958] mcelog: cpu 36 bank 8 status be00000000010090
[ 1.119201] mcelog() stashed at entry=0
[ 1.203602] mce: [Hardware Error]: Machine check events logged
[ 1.220313] #37
[ 1.220412] BUG: unable to handle kernel
[ 1.226954] #38
[ 1.229107] NULL pointer dereference at 0000000000000008
[ 1.235052] IP: [<ffffffff810980a1>] process_one_work+0x31/0x420
[ 1.236829] #39PGD 0
[ 1.244558] Oops: 0000 [#1] SMP
[ 1.248189] Modules linked in:
[ 1.251617] CPU: 36 PID: 263 Comm: kworker/36:0 Not tainted 4.1.0-rc8 #9
[ 1.259100] #40
[ 1.259100] Hardware name: Intel Corporation BRICKLAND/BRICKLAND, BIOS BRHSXSD1.86B.0065.R01.1505011640 05/01/2015
[ 1.272832] #41
[ 1.272833] task: ffff88181c1f4470 ti: ffff88181bd24000 task.ti: ffff88181bd24000
[ 1.283350] RIP: 0010:[<ffffffff810980a1>] [ 1.286433] #42
[<ffffffff810980a1>] process_one_work+0x31/0x420
[ 1.294976] RSP: 0000:ffff88181bd27e08 EFLAGS: 00010046

I.e. we die on the first attempt to log ... but that attempt is a long way into bringing up all the cpus.
CPU#36 is the first one from socket2 (counting 0, 1, 2, 3).

-Tony