x86/microcode/intel: Division by zero panic in 4.9.79 and 4.4.114

From: Rolf Neugebauer
Date: Tue Feb 06 2018 - 09:09:46 EST


The backport of 7e702d17ed1 ("x86/microcode/intel: Extend BDW
late-loading further with LLC size check") to 4.9.79 and 4.4.14 causes
a division by zero panic on some single vCPU machine types on Google
Cloud (e.g. g1-small and n1-standard-1):

[ 1.591435] divide error: 0000 [#1] SMP
[ 1.591961] Modules linked in:
[ 1.592546] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.9.79-linuxkit #1
[ 1.593461] Hardware name: Google Google Compute Engine/Google
Compute Engine, BIOS Google 01/01/2011
[ 1.595135] task: ffffa01b2b63c040 task.stack: ffffb24f40010000
[ 1.596683] RIP: 0010:[<ffffffffb7d88ce3>] [<ffffffffb7d88ce3>]
init_intel_microcode+0x49/0x5a
[ 1.598018] RSP: 0000:ffffb24f40013e48 EFLAGS: 00010206
[ 1.599016] RAX: 0000000001400000 RBX: 00000000ffffffea RCX: 0000000000000000
[ 1.600093] RDX: 0000000000000000 RSI: ffffa01b2fffa9a8 RDI: ffffffffb7d8888b
[ 1.601222] RBP: 00000000ffffffff R08: ffffa01b2fffa9a1 R09: 000000000000005f
[ 1.602330] R10: 0000000000000067 R11: ffffa01b22164177 R12: 0000000000000000
[ 1.603291] R13: ffffffffb7d787b6 R14: 0000000000000000 R15: 0000000000000000
[ 1.604406] FS: 0000000000000000(0000) GS:ffffa01b2fc00000(0000)
knlGS:0000000000000000
[ 1.606154] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1.607493] CR2: 0000000000000000 CR3: 000000002ac0a000 CR4: 0000000000040630
[ 1.609310] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1.611122] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1.613008] Stack:
[ 1.613445] ffffffffb7d888c1 0000000000000000 ffffffffb7d1ba40
0000000000000000
[ 1.615423] 0000000000000000 ffffffffb7d787b6 0000000000000000
0000000000000000
[ 1.617049] ffffffffb74fed89 00000000fffffff4 00000000ffffffff
4eae7bc133547c47
[ 1.617904] Call Trace:
[ 1.618198] [<ffffffffb7d888c1>] ? microcode_init+0x36/0x1de
[ 1.619264] [<ffffffffb7d787b6>] ? set_debug_rodata+0xc/0xc
[ 1.620143] [<ffffffffb74fed89>] ? driver_register+0xaf/0xb4
[ 1.621841] [<ffffffffb7d8888b>] ? save_microcode_in_initrd+0x3c/0x3c
[ 1.623424] [<ffffffffb70021b7>] ? do_one_initcall+0x98/0x130
[ 1.624659] [<ffffffffb7d787b6>] ? set_debug_rodata+0xc/0xc
[ 1.626303] [<ffffffffb7d79091>] ? kernel_init_freeable+0x166/0x1e7
[ 1.627674] [<ffffffffb77ce3d7>] ? rest_init+0x6e/0x6e
[ 1.628942] [<ffffffffb77ce3e1>] ? kernel_init+0xa/0xe6
[ 1.630243] [<ffffffffb77d8587>] ? ret_from_fork+0x57/0x70
[ 1.631592] Code: 16 0f b6 35 a0 ac fb ff 48 c7 c7 c4 ea 99 b7 e8
82 3e 41 ff 31 c0 c3 8b 05 3b ad fb ff 0f b7 0d 54 ad fb ff 31 d2 c1
e0 0a 48 98 <48> f7 f1 89 05 f4 6a 14 00 48 c7 c0 00 d8 c2 b7 c3 4c 8d
54 24
[ 1.638875] RIP [<ffffffffb7d88ce3>] init_intel_microcode+0x49/0x5a
[ 1.640382] RSP <ffffb24f40013e48>
[ 1.641320] ---[ end trace e88ef332e19594b9 ]---
[ 1.642418] Kernel panic - not syncing: Fatal exception
[ 1.644581] Kernel Offset: 0x36000000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 1.647312] ---[ end Kernel panic - not syncing: Fatal exception

On these machines x86_max_cores is zero:

[ 0.227977] ftrace: allocating 35563 entries in 139 pages
[ 0.273941] smpboot: x86_max_cores == zero !?!?
[ 0.274645] smpboot: Max logical packages: 1


4.9.78 (and earlier) as well as 4.4.13 (and earlier) worked fine so
this seems like a regression in stable.

4.14.16 and 4.15 (to which the above patch was backported as well)
work fine, but I noticed significant code changes for these kernels
around where x86_max_cores is set.

The obvious quick fix/hack is to check for x86_max_cores in
calc_llc_size_per_core() and return 0. I'm happy to submit a patch for
this, but a more correct fix might be back porting the relevant
changes around where x86_max_cores is set.

Thanks
Rolf