Re: 3.16.49 Oops, does not boot on two socket server

From: Holger Kiehl
Date: Tue Dec 12 2017 - 10:47:10 EST


Hello,

just want to give a follow up. I have tested this with 3.16.51 and the
problem still exists. It seems the 3.16.x tree is no longer usable
for two socket servers :-(

Regards,
Holger

PS: here the panic with 3.16.51:

smpboot: Total of 24 processors activated (95963.71 BogoMIPS)
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1 at kernel/sched/core.c:5811 init_overlap_sched_group+0x114/0x120()
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.16.51-1.el6.x86_64 #1
Hardware name: HP ProLiant DL380p Gen8, BIOS P70 08/02/2014
0000000000000000 ffff880fe96c7da8 ffffffff815432dc 0000000000000000
00000000000016b3 ffff880fe96c7de8 ffffffff8104cc72 ffff880fff803c00
ffff880fe8d05650 ffff881fe96ba3a8 ffff880fe96af540 0000000000000000
Call Trace:
[<ffffffff815432dc>] dump_stack+0x4e/0x6a
[<ffffffff8104cc72>] warn_slowpath_common+0x82/0xb0
[<ffffffff8104ccb5>] warn_slowpath_null+0x15/0x20
[<ffffffff810799c4>] init_overlap_sched_group+0x114/0x120
[<ffffffff81079b04>] build_overlap_sched_groups+0x134/0x1e0
[<ffffffff8107a049>] build_sched_domains+0x159/0x330
[<ffffffff817c2b45>] sched_init_smp+0x65/0xf8
[<ffffffff817abb12>] kernel_init_freeable+0xb2/0x12d
[<ffffffff81541770>] ? rest_init+0x80/0x80
[<ffffffff81541779>] kernel_init+0x9/0xf0
[<ffffffff81547688>] ret_from_fork+0x58/0x90
[<ffffffff81541770>] ? rest_init+0x80/0x80
---[ end trace 207206398bdf8ddb ]---
BUG: unable to handle kernel paging request at 0000010000024a7f
IP: [<ffffffff8107995e>] init_overlap_sched_group+0xae/0x120
PGD 0
Oops: 0000 [#1] SMP
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W 3.16.51-1.el6.x86_64 #1
Hardware name: HP ProLiant DL380p Gen8, BIOS P70 08/02/2014
task: ffff880fe96d0000 ti: ffff880fe96c4000 task.ti: ffff880fe96c4000
RIP: 0010:[<ffffffff8107995e>] [<ffffffff8107995e>] init_overlap_sched_group+0xae/0x120
RSP: 0000:ffff880fe96c7e08 EFLAGS: 00010246
RAX: 000001000000ffff RBX: ffff880fe8d05650 RCX: 0000000000000020
RDX: 0000000000014a80 RSI: 0000000000000020 RDI: 0000000000000020
RBP: ffff880fe96c7e28 R08: ffff880fe96af558 R09: 0000000000000000
R10: 0000000000000002 R11: 0000000000000001 R12: ffff881fe96ba3a8
R13: ffff880fe96af540 R14: 0000000000000000 R15: ffff881fe96ba3a8
FS: 0000000000000000(0000) GS:ffff880fffc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000010000024a7f CR3: 0000000001714000 CR4: 00000000000407f0
Stack:
0000000000000000 0000000000000000 0000000000000000 ffff880fe8d05650
ffff880fe96c7ea8 ffffffff81079b04 0000000000000011 ffff880fe96af540
0000000000000000 0000000000000000 000000000000cd68 0000000000000000
Call Trace:
[<ffffffff81079b04>] build_overlap_sched_groups+0x134/0x1e0
[<ffffffff8107a049>] build_sched_domains+0x159/0x330
[<ffffffff817c2b45>] sched_init_smp+0x65/0xf8
[<ffffffff817abb12>] kernel_init_freeable+0xb2/0x12d
[<ffffffff81541770>] ? rest_init+0x80/0x80
[<ffffffff81541779>] kernel_init+0x9/0xf0
[<ffffffff81547688>] ret_from_fork+0x58/0x90
[<ffffffff81541770>] ? rest_init+0x80/0x80
Code: 60 83 00 85 c0 74 70 49 8d 75 18 48 c7 c2 38 f9 8a 81 bf ff ff ff ff e8 31 f9 1f 00 49 8b 54 24 10 48 98 48 8b 04 c5 a0 fc 78 81 <48> 8b 14 10 b8 01 00 00 00 49 89 55 10 f0 0f c1 02 85 c0 75 0f
RIP [<ffffffff8107995e>] init_overlap_sched_group+0xae/0x120
RSP <ffff880fe96c7e08>
CR2: 0000010000024a7f
---[ end trace 207206398bdf8ddc ]---
Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009


On Wed, 18 Oct 2017, Holger Kiehl wrote:

> Hello,
>
> just tried to boot 3.16.49 on a 2 socket server and it fails with the
> following error:
>
> smpboot: Total of 24 processors activated (95818.36 BogoMIPS)
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 1 at kernel/sched/core.c:5811 init_overlap_sched_group+0x114/0x120()
> Modules linked in:
> CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.16.49-1.el6.x86_64 #1
> Hardware name: HP ProLiant DL380p Gen8, BIOS P70 08/02/2014
> 0000000000000000 ffff880bfd6d3da8 ffffffff81542f1c 0000000000000000
> 00000000000016b3 ffff880bfd6d3de8 ffffffff8104cd72 ffff880c0f803c00
> ffff880bfcc69650 ffff8817fd695ca8 ffff880bfd6e2300 0000000000000000
> Call Trace:
> [<ffffffff81542f1c>] dump_stack+0x4e/0x6a
> [<ffffffff8104cd72>] warn_slowpath_common+0x82/0xb0
> [<ffffffff8104cdb5>] warn_slowpath_null+0x15/0x20
> [<ffffffff81079834>] init_overlap_sched_group+0x114/0x120
> [<ffffffff81079974>] build_overlap_sched_groups+0x134/0x1e0
> [<ffffffff8107a169>] build_sched_domains+0x159/0x330
> [<ffffffff817c2b3c>] sched_init_smp+0x65/0xf8
> [<ffffffff817abb12>] kernel_init_freeable+0xb2/0x12d
> [<ffffffff81541400>] ? rest_init+0x80/0x80
> [<ffffffff81541409>] kernel_init+0x9/0xf0
> [<ffffffff81547248>] ret_from_fork+0x58/0x90
> [<ffffffff81541400>] ? rest_init+0x80/0x80
> ---[ end trace a491a27c866dd06e ]---
> BUG: unable to handle kernel paging request at 00000100000247bf
> IP: [<ffffffff810797ce>] init_overlap_sched_group+0xae/0x120
> PGD 0
> Oops: 0000 [#1] SMP
> Modules linked in:
> CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W 3.16.49-1.el6.x86_64 #1
> Hardware name: HP ProLiant DL380p Gen8, BIOS P70 08/02/2014
> task: ffff8817fd6a8000 ti: ffff880bfd6d0000 task.ti: ffff880bfd6d0000
> RIP: 0010:[<ffffffff810797ce>] [<ffffffff810797ce>] init_overlap_sched_group+0xae/0x120
> RSP: 0000:ffff880bfd6d3e08 EFLAGS: 00010246
> RAX: 000001000000ffff RBX: ffff880bfcc69650 RCX: 0000000000000020
> RDX: 00000000000147c0 RSI: 0000000000000020 RDI: 0000000000000020
> RBP: ffff880bfd6d3e28 R08: ffff880bfd6e2318 R09: 0000000000000000
> R10: 0000000000000002 R11: 0000000000000001 R12: ffff8817fd695ca8
> R13: ffff880bfd6e2300 R14: 0000000000000000 R15: ffff8817fd695ca8
> FS: 0000000000000000(0000) GS:ffff880c0fc00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00000100000247bf CR3: 0000001714000 CR4: 00000000000407f0
> Stack:
> 0000000000000000 0000000000000000 0000000000000000 ffff880bfcc69650
> ffff880bfd6d3ea8 ffffffff81079974 0000000000000011 ffff880bfd6e2300
> 0000000000000000 0000000000000000 000000000000cac8 0000000000000000
> Call Trace:
> [<ffffffff81079974>] build_overlap_sched_groups+0x134/0x1e0
> [<ffffffff8107a169>] build_sched_domains+0x159/0x330
> [<ffffffff817c2b3c>] sched_init_smp+0x65/0xf8
> [<ffffffff817abb12>] kernel_init_freeable+0xb2/0x12d
> [<ffffffff81541400>] ? rest_init+0x80/0x80
> [<ffffffff81541409>] kernel_init+0x9/0xf0
> [<ffffffff81547248>] ret_from_fork+0x58/0x90
> [<ffffffff81541400>] ? rest_init+0x80/0x80
> Code: 61 83 00 85 c0 74 70 49 8d 75 18 48 c7 c2 38 f9 8a 81 bf ff ff ff ff e8 51 fa 1f 00 49 8b 54 24 10 48 98 48 8b 04 c5 a0 fc 78 81 <48> 8b 14 10 b8 01 00 00 00 49 89 55 10 f0 0f c1 02 85 c0 75 0f
> RIP [<ffffffff810797ce>] init_overlap_sched_group+0xae/0x120
> RSP <ffff880bfd6d3e08>
> CR2: 00000100000247bf
> ---[ end trace a491a27c866dd06f ]---
> Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
>
> Rebooting in 5 seconds..
>
> This happened on three different systems. On a similar system with just
> one CPU in a socket it boots fine. The last Kernel of this series I tried
> was 2.16.48 and that worked fine.
>
> Any idea what is wrong? In case it is useful I have attached my kernel
> config.
>
> Regards,
> Holger