Re: [Patch V2] x86, mce: Ensure offline CPU's don't participate in mce rendezvous process.

From: Borislav Petkov
Date: Tue Dec 08 2015 - 13:57:03 EST


On Tue, Dec 08, 2015 at 03:59:58PM +0000, Luck, Tony wrote:
> > No, the system did panic in both times. The "strange" observation is
> > that the MCE gets reported only on the cores on node 0. Or at least only
> > the printks from mce_panic() on the cores on node0 reach the serial
> > console.
>
> You only see messages and logs from node0, because the cpus there are
> the only ones that see any errors logged in their banks.
>
> The cpus on node 1, 2, 3 scan all banks and find nothing, so say nothing.

Right, sure, of course. Doh!

Confirmation:

[ 183.840517] mce: do_machine_check: CPU: 30
[ 183.840531] mce: do_machine_check: CPU: 27
[ 183.840536] mce: do_machine_check: CPU: 29
[ 183.840541] mce: do_machine_check: CPU: 56
[ 183.840546] mce: do_machine_check: CPU: 28
[ 183.840548] mce: do_machine_check: CPU: 60
[ 183.840550] mce: do_machine_check: CPU: 24
[ 183.840557] mce: do_machine_check: CPU: 12
[ 183.840561] mce: do_machine_check: CPU: 45
[ 183.840565] mce: do_machine_check: CPU: 59
[ 183.840569] mce: do_machine_check: CPU: 57
[ 183.840572] mce: do_machine_check: CPU: 61
[ 183.840584] mce: do_machine_check: CPU: 0
[ 183.840587] mce: do_machine_check: CPU: 32
[ 183.840593] mce: do_machine_check: CPU: 63
[ 183.840596] mce: do_machine_check: CPU: 31
[ 183.840602] mce: do_machine_check: CPU: 42
[ 183.840606] mce: do_machine_check: CPU: 11
[ 183.840611] mce: do_machine_check: CPU: 41
[ 183.840613] mce: do_machine_check: CPU: 9
[ 183.840617] mce: do_machine_check: CPU: 62
[ 183.840619] mce: do_machine_check: CPU: 25
[ 183.840624] mce: do_machine_check: CPU: 58
[ 183.840627] mce: do_machine_check: CPU: 26
[ 183.840633] mce: do_machine_check: CPU: 5
[ 183.840638] mce: do_machine_check: CPU: 1
[ 183.840642] mce: do_machine_check: CPU: 37
[ 183.840648] mce: do_machine_check: CPU: 15
[ 183.840650] mce: do_machine_check: CPU: 47
[ 183.840653] mce: do_machine_check: CPU: 44
[ 183.840657] mce: do_machine_check: CPU: 14
[ 183.840659] mce: do_machine_check: CPU: 46
[ 183.840666] mce: do_machine_check: CPU: 52
[ 183.840670] mce: do_machine_check: CPU: 50
[ 183.840675] mce: do_machine_check: CPU: 48
[ 183.840677] mce: do_machine_check: CPU: 16
[ 183.840682] mce: do_machine_check: CPU: 54
[ 183.840686] mce: do_machine_check: CPU: 18
[ 183.840692] mce: do_machine_check: CPU: 40
[ 183.840695] mce: do_machine_check: CPU: 8
[ 183.840701] mce: do_machine_check: CPU: 2
[ 183.840705] mce: do_machine_check: CPU: 20
[ 183.840710] mce: do_machine_check: CPU: 13
[ 183.840712] mce: do_machine_check: CPU: 43
[ 183.840716] mce: do_machine_check: CPU: 10
[ 183.840722] mce: do_machine_check: CPU: 3
[ 183.840724] mce: do_machine_check: CPU: 35
[ 183.840727] mce: do_machine_check: CPU: 33
[ 183.840730] mce: do_machine_check: CPU: 34
[ 183.840734] mce: do_machine_check: CPU: 6
[ 183.840738] mce: do_machine_check: CPU: 38
[ 183.840743] mce: do_machine_check: CPU: 53
[ 183.840745] mce: do_machine_check: CPU: 21
[ 183.840750] mce: do_machine_check: CPU: 23
[ 183.840752] mce: do_machine_check: CPU: 55
[ 183.840755] mce: do_machine_check: CPU: 22
[ 183.840759] mce: do_machine_check: CPU: 49
[ 183.840761] mce: do_machine_check: CPU: 17
[ 183.840767] mce: do_machine_check: CPU: 19
[ 183.840770] mce: do_machine_check: CPU: 51
[ 183.840776] mce: do_machine_check: CPU: 39
[ 183.840778] mce: do_machine_check: CPU: 7
[ 183.840784] mce: do_machine_check: CPU: 36
[ 183.840786] mce: do_machine_check: CPU: 4
[ 184.485104] Disabling lock debugging due to kernel taint
[ 184.498006] mce: [Hardware Error]: CPU 32: Machine Check Exception: 5 Bank 5: be00000000010090
[ 184.498023] mce: [Hardware Error]: Machine check events logged
[ 184.531428] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8135de7f> {intel_idle+0xbf/0x130}
[ 184.551126] mce: [Hardware Error]: TSC c760ad064ccce ADDR bb68ec00 MISC 421c8c86
[ 184.568358] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1449600598 SOCKET 0 APIC 1 microcode 710
[ 184.588862] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
...

mce: [Hardware Error]: CPU 0: Machine Check Exception: 5 Bank 5: be00000000010090
mce: [Hardware Error]: CPU 1: Machine Check Exception: 5 Bank 5: be00000000010090
mce: [Hardware Error]: CPU 2: Machine Check Exception: 5 Bank 5: be00000000010090
mce: [Hardware Error]: CPU 32: Machine Check Exception: 5 Bank 5: be00000000010090
mce: [Hardware Error]: CPU 33: Machine Check Exception: 5 Bank 5: be00000000010090
mce: [Hardware Error]: CPU 34: Machine Check Exception: 5 Bank 5: be00000000010090
mce: [Hardware Error]: CPU 35: Machine Check Exception: 5 Bank 5: be00000000010090
mce: [Hardware Error]: CPU 36: Machine Check Exception: 5 Bank 5: be00000000010090
mce: [Hardware Error]: CPU 37: Machine Check Exception: 5 Bank 5: be00000000010090
mce: [Hardware Error]: CPU 38: Machine Check Exception: 5 Bank 5: be00000000010090
mce: [Hardware Error]: CPU 39: Machine Check Exception: 5 Bank 5: be00000000010090
mce: [Hardware Error]: CPU 3: Machine Check Exception: 5 Bank 5: be00000000010090
mce: [Hardware Error]: CPU 4: Machine Check Exception: 5 Bank 5: be00000000010090
mce: [Hardware Error]: CPU 5: Machine Check Exception: 5 Bank 5: be00000000010090
mce: [Hardware Error]: CPU 6: Machine Check Exception: 5 Bank 5: be00000000010090
mce: [Hardware Error]: CPU 7: Machine Check Exception: 5 Bank 5: be00000000010090

CPUs:

[ 1.103200] x86: Booting SMP configuration:
[ 1.112441] .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7
[ 1.227835] .... node #1, CPUs: #8 #9 #10 #11 #12 #13 #14 #15
[ 1.451861] .... node #2, CPUs: #16 #17 #18 #19 #20 #21 #22 #23
[ 1.674819] .... node #3, CPUs: #24 #25 #26 #27 #28 #29 #30 #31
[ 1.899011] .... node #0, CPUs: #32 #33 #34 #35 #36 #37 #38 #39
[ 2.026616] .... node #1, CPUs: #40 #41 #42 #43 #44 #45 #46 #47
[ 2.152645] .... node #2, CPUs: #48 #49 #50 #51 #52 #53 #54 #55
[ 2.276782] .... node #3, CPUs: #56 #57 #58 #59 #60 #61 #62 #63
[ 2.402263] x86: Booted up 4 nodes, 64 CPUs

Ok, all clear.

Thanks!

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/