Re: ipmi_msghandler crashes in 4.19

From: Greg KH
Date: Tue Jan 29 2019 - 05:29:05 EST


On Tue, Jan 15, 2019 at 10:36:42AM -0800, Ivan Babrou wrote:
> Hey,
>
> We've upgraded some machines from 4.14 to 4.19 and started seeing rare
> crashes like these:
>
> [75855.909507] BUG: unable to handle kernel NULL pointer dereference
> at 0000000000000d00
> [75855.925667] PGD 0 P4D 0
> [75855.936359] Oops: 0000 [#1] SMP PTI
> [75855.947951] CPU: 0 PID: 10 Comm: ksoftirqd/0 Tainted: G O
> 4.19.13-cloudflare-2019.1.4 #2019.1.4
> [75855.966028] Hardware name: Quanta Cloud Technology Inc. QuantaPlex
> T42S-2U(LBG-4) -/T42S-2U MB (Lewisburg-4), BIOS 3A11.Q10 06/29/2018
> [75855.994246] RIP: 0010:__srcu_read_unlock+0xe/0x20
> [75856.006851] Code: 01 48 63 c8 65 48 ff 04 ca f0 83 44 24 fc 00 c3
> 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f0 83 44 24 fc 00
> 48 63 f6 <48> 8b 87 e8 0c 00 00 65 48 ff 44 f0 10 c3 0f 1f 40 00 0f 1f
> 44 00
> [75856.041551] RSP: 0018:ffffba00cc66fd48 EFLAGS: 00010286
> [75856.054564] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
> [75856.069449] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000018
> [75856.084168] RBP: ffffa28276abb200 R08: ffffa29119772540 R09: 0000000000000000
> [75856.098756] R10: 00000000000c1425 R11: ffffa29120a201c8 R12: ffffa29118d57e08
> [75856.113422] R13: dead000000000200 R14: dead000000000100 R15: ffffa27dcbafa400
> [75856.127798] FS: 0000000000000000(0000) GS:ffffa29120a00000(0000)
> knlGS:0000000000000000
> [75856.138973] perf: interrupt took too long (7735 > 7677), lowering
> kernel.perf_event_max_sample_rate to 25000
> [75856.143083] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [75856.172956] CR2: 0000000000000d00 CR3: 000000187ca0a005 CR4: 00000000007606f0
> [75856.187116] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [75856.201312] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [75856.215274] PKRU: 55555554
> [75856.224621] Call Trace:
> [75856.230942] perf: interrupt took too long (9748 > 9668), lowering
> kernel.perf_event_max_sample_rate to 20000
> [75856.233560] deliver_response+0x88/0xd0 [ipmi_msghandler]
> [75856.261744] deliver_local_response+0xe/0x30 [ipmi_msghandler]
> [75856.273937] handle_one_recv_msg+0x164/0xbf0 [ipmi_msghandler]
> [75856.285962] ? __switch_to_asm+0x34/0x70
> [75856.295957] ? __switch_to_asm+0x40/0x70
> [75856.306011] ? __switch_to_asm+0x34/0x70
> [75856.315872] ? __switch_to_asm+0x40/0x70
> [75856.325562] ? __switch_to_asm+0x34/0x70
> [75856.325565] ? __switch_to_asm+0x40/0x70
> [75856.325567] ? __switch_to_asm+0x34/0x70
> [75856.325569] ? __switch_to_asm+0x40/0x70
> [75856.325578] handle_new_recv_msgs+0x16d/0x1e0 [ipmi_msghandler]
> [75856.325583] ? __switch_to_asm+0x34/0x70
> [75856.381815] tasklet_action_common.isra.21+0x4e/0xf0
> [75856.381823] __do_softirq+0xd8/0x2d2
> [75856.399498] ? sort_range+0x20/0x20
> [75856.399506] run_ksoftirqd+0x1a/0x20
> [75856.415184] smpboot_thread_fn+0xc5/0x160
> [75856.415190] kthread+0x113/0x130
> [75856.430502] ? kthread_create_worker_on_cpu+0x70/0x70
> [75856.430512] ret_from_fork+0x35/0x40
> [75856.446793] Modules linked in: xt_connlimit nf_conncount xt_bpf
> xt_hashlimit cls_flow cls_u32 sch_htb sch_fq md_mod dm_crypt
> algif_skcipher af_alg dm_mod dax ip6table_nat nf_nat_ipv6
> ip6table_mangle ip6table_security ip6table_raw ip6table_filter
> ip6_tables xt_nat iptable_nat nf_nat_ipv4 nf_nat xt_TPROXY
> nf_tproxy_ipv6 nf_tproxy_ipv4 xt_connmark iptable_mangle xt_owner
> xt_CT xt_socket nf_socket_ipv4 nf_socket_ipv6 iptable_raw
> nfnetlink_log xt_NFLOG xt_tcpudp xt_comment xt_conntrack nf_conntrack
> nf_defrag_ipv6 nf_defrag_ipv4 xt_mark xt_multiport xt_set
> iptable_filter bpfilter ip_set_hash_netport ip_set_hash_net
> ip_set_hash_ip ip_set nfnetlink 8021q garp mrp stp llc skx_edac
> x86_pkg_temp_thermal kvm_intel kvm irqbypass crc32_pclmul crc32c_intel
> ipmi_ssif pcbc aesni_intel aes_x86_64 crypto_simd sfc(O)
> [75856.446862] cryptd glue_helper mdio ipmi_si xhci_pci i40e tpm_crb
> ioatdma ipmi_devintf xhci_hcd dca ipmi_msghandler tpm_tis tpm_tis_core
> tpm efivarfs ip_tables x_tables
> [75856.569103] CR2: 0000000000000d00
> [75856.569124] ---[ end trace 604e13a0789ee766 ]---
>
> [117620.868720] general protection fault: 0000 [#1] SMP PTI
> [117620.911871] CPU: 0 PID: 10 Comm: ksoftirqd/0 Tainted: G
> O 4.19.0-cloudflare-2018.10.3 #1
> [117620.937885] Hardware name: Quanta Computer Inc QuantaPlex
> T41S-2U/S2S-MB, BIOS S2S_3B10.03 06/21/2018
> [117620.963750] RIP: 0010:__srcu_read_unlock+0xe/0x20
> [117620.984950] Code: 01 48 63 c8 65 48 ff 04 ca f0 83 44 24 fc 00 c3
> 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f0 83 44 24 fc 00
> 48 63 f6 <48> 8b 87 e8 0c 00 00 65 48 ff 44 f0 10 c3 0f 1f 40
> 00 0f 1f 44 00
> [117621.020240] perf: interrupt took too long (10250 > 10230),
> lowering kernel.perf_event_max_sample_rate to 19000
> [117621.036578] RSP: 0018:ffff89007f603e38 EFLAGS: 00010286
> [117621.073528] perf: interrupt took too long (12979 > 12812),
> lowering kernel.perf_event_max_sample_rate to 15000
> [117621.084232] RAX: 0000000000000000 RBX: 0000000000000000 RCX:
> 0000000000000000
> [117621.133897] RDX: 0000000000000001 RSI: 0000000000000000 RDI:
> 403a080083ad0878
> [117621.156877] RBP: ffff890d90a78e00 R08: 0000000000000002 R09:
> 0000000000020900
> [117621.179507] R10: 0000eb0270fbf3f0 R11: ffff89007f603ca4 R12:
> ffff89107b411e08
> [117621.179509] R13: dead000000000200 R14: dead000000000100 R15:
> ffff890a9b3e6800
> [117621.179511] FS: 0000000000000000(0000) GS:ffff89007f600000(0000)
> knlGS:0000000000000000
> [117621.179513] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [117621.179514] CR2: 00007f193f3095e0 CR3: 0000001f79e0a001 CR4:
> 00000000003606f0
> [117621.179526] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [117621.179527] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [117621.179529] Call Trace:
> [117621.179532] <IRQ>
> [117621.179552] deliver_response+0x88/0xd0 [ipmi_msghandler]
> [117621.179557] deliver_local_response+0xe/0x30 [ipmi_msghandler]
> [117621.179561] handle_one_recv_msg+0x164/0xbf0 [ipmi_msghandler]
> [117621.179568] ? try_to_wake_up+0x54/0x470
> [117621.179575] ? ipmi_si_platform_shutdown+0x20/0x20 [ipmi_si]
> [117621.236448] perf: interrupt took too long (16285 > 16223),
> lowering kernel.perf_event_max_sample_rate to 12000
> [117621.247534] ? kcs_event+0x17d/0x730 [ipmi_si]
> [117621.426069] perf: interrupt took too long (20619 > 20356),
> lowering kernel.perf_event_max_sample_rate to 9000
> [117621.437773] handle_new_recv_msgs+0x16d/0x1e0 [ipmi_msghandler]
> [117621.535276] tasklet_action_common.isra.21+0x4e/0xf0
> [117621.535284] __do_softirq+0xd8/0x2d2
> [117621.567383] irq_exit+0xb4/0xc0
> [117621.567387] smp_apic_timer_interrupt+0x74/0x140
> [117621.567390] apic_timer_interrupt+0xf/0x20
> [117621.567392] </IRQ>
> [117621.567397] RIP: 0010:finish_task_switch+0x78/0x260
> [117621.567399] Code: 65 48 8b 1c 25 00 4d 01 00 0f 1f 44 00 00 0f 1f
> 44 00 00 41 c7 46 38 00 00 00 00 41 c6 04 24 00 fb 65 48 8b 04 25 00
> 4d 01 00 <0f> 1f 44 00 00 4d 85 ed 74 1a 41 8b 85 80 03 00 00

This should all be fixed in the latest 4.19.y release, right?

thanks,

greg k-h