Mellanox Technologies MT23108 causes #MC exceptions under heavy load
From: Maxim Levitsky
Date: Thu Mar 05 2015 - 22:36:10 EST
We are running CPU and network heavy test on marmot.pdl.cmu.edu cluster.
It has Mellanox Technologies MT23108 InfiniHost controller.
When we start using it for network communications, after just few
minutes some of the nodes of the cluster die
with the following machine check exception.
I repeated this test with Ethernet few times and had not an single
failure so far (I thought to had one but it turned to be another
unrelated issue)
It happened already on most nodes of this 128 node cluster, thus I
expect this to be kernel bug.
Do you have any pointers what we could try?
I compiled and tested current HEAD of the vanilla kernel
(99aedde0869ce194539166ac5a4d2e1a20995348)
4.0.0-rc2
but this happens even on 2.6.38 (which was in one of
their stock kernel images).
Best regards,
Maxim Levitsky
The kernel log of failure captured via serial console:
[ 297.575167] ib0: can't use GFP_NOIO for QPs on device mthca0, using
GFP_KERNEL
[ 564.704428] ib0: can't use GFP_NOIO for QPs on device mthca0, using
GFP_KERNEL
[ 951.619320] ib0: can't use GFP_NOIO for QPs on device mthca0, using
GFP_KERNEL
[ 956.790789] ib0: can't use GFP_NOIO for QPs on device mthca0, using
GFP_KERNEL
[ 957.301036] ib0: can't use GFP_NOIO for QPs on device mthca0, using
GFP_KERNEL
[ 957.333938] ib0: can't use GFP_NOIO for QPs on device mthca0, using
GFP_KERNEL
[ 957.924656] ib0: can't use GFP_NOIO for QPs on device mthca0, using
GFP_KERNEL
[ 958.125879] ib0: can't use GFP_NOIO for QPs on device mthca0, using
GFP_KERNEL
[ 958.147588] ib0: can't use GFP_NOIO for QPs on device mthca0, using
GFP_KERNEL
[ 958.485607] ib0: can't use GFP_NOIO for QPs on device mthca0, using
GFP_KERNEL
[ 959.050155] ib0: can't use GFP_NOIO for QPs on device mthca0, using
GFP_KERNEL
[ 959.120109] ib0: can't use GFP_NOIO for QPs on device mthca0, using
GFP_KERNEL
[ 960.048666] ib0: can't use GFP_NOIO for QPs on device mthca0, using
GFP_KERNEL
[ 960.110928] ib0: can't use GFP_NOIO for QPs on device mthca0, using
GFP_KERNEL
[ 960.754363] ib0: can't use GFP_NOIO for QPs on device mthca0, using
GFP_KERNEL
[ 961.390093] ib0: can't use GFP_NOIO for QPs on device mthca0, using
GFP_KERNEL
[ 972.199782] ib0: can't use GFP_NOIO for QPs on device mthca0, using
GFP_KERNEL
[ 972.496511] ib0: can't use GFP_NOIO for QPs on device mthca0, using
GFP_KERNEL
[ 983.078444] ib0: can't use GFP_NOIO for QPs on device mthca0, using
GFP_KERNEL
[ 983.618178] ib0: can't use GFP_NOIO for QPs on device mthca0, using
GFP_KERNEL
[ 991.365565] ib0: can't use GFP_NOIO for QPs on device mthca0, using
GFP_KERNEL
[ 1003.344498] ib0: can't use GFP_NOIO for QPs on device mthca0, using
GFP_KERNEL
[ 1013.748036] Disabling lock debugging due to kernel taint
[ 1013.747903] [Hardware Error]: System Fatal error.
[ 1013.747903] [Hardware Error]: CPU:0 (f:5:1)
MC4_STATUS[-|UE|-|PCC|-]: 0xb200000000070f0f
[ 1013.747903] [Hardware Error]: MC4 Error (node 0): Watchdog timeout
due to lack of progress.
[ 1013.747903] [Hardware Error]: cache level: L3/GEN, mem/io: GEN,
mem-tx: GEN, part-proc: GEN (timed out)
[ 1013.747903] mce: [Hardware Error]: CPU 0: Machine Check Exception:
4 Bank 4: b200000000070f0f
[ 1013.747903] mce: [Hardware Error]: TSC 1a2dcecb6b8
[ 1013.747903] mce: [Hardware Error]: PROCESSOR 2:f51 TIME 1425610753
SOCKET 0 APIC 0 microcode 0
[ 1013.747903] [Hardware Error]: System Fatal error.
[ 1013.747903] [Hardware Error]: CPU:0 (f:5:1)
MC4_STATUS[-|UE|-|PCC|-]: 0xb200000000070f0f
[ 1013.747903] [Hardware Error]: MC4 Error (node 0): Watchdog timeout
due to lack of progress.
[ 1013.747903] [Hardware Error]: cache level: L3/GEN, mem/io: GEN,
mem-tx: GEN, part-proc: GEN (timed out)
[ 1013.747903] mce: [Hardware Error]: Machine check: Processor context corrupt
[ 1013.747903] Kernel panic - not syncing: Fatal machine check on current CPU
[ 1013.748036] [Hardware Error]: System Fatal error.
[ 1013.748036] [Hardware Error]: CPU:1 (f:5:1)
MC4_STATUS[-|UE|-|PCC|-]: 0xb200000000070f0f
[ 1013.748036] [Hardware Error]: MC4 Error (node 1): Watchdog timeout
due to lack of progress.
[ 1013.748036] [Hardware Error]: cache level: L3/GEN, mem/io: GEN,
mem-tx: GEN, part-proc: GEN (timed out)
[ 1013.747903] Kernel Offset: disabled
[ 1013.747903] ---[ end Kernel panic - not syncing: Fatal machine
check on current CPU
[ 1019.239423] ------------[ cut here ]------------
[ 1019.244144] WARNING: CPU: 0 PID: 13875 at arch/x86/kernel/smp.c:124
native_smp_send_reschedule+0x5f/0x70()
[ 1019.249416] Modules linked in: ib_ipoib ib_cm ib_sa nfsv2 nfs lockd
sunrpc grace i2c_piix4 ib_mthca ib_mad ib_core ib_addr shpchp
amd64_edac_mod i2c_amd756 k8temp amd_rng edac_core edac_mce_amd tg3
ptp pps_core sata_promise pata_amd
[ 1019.249416] CPU: 0 PID: 13875 Comm: java Tainted: G M
4.0.0-rc2+ #1
[ 1019.249416] Hardware name: RIOWORKS HDAMA/HDAMA, BIOS V2.17 03/20/2006
[ 1019.249416] 000000000000007c ffff8801f8409a80 ffffffff815f33ff
000000000000007c
[ 1019.249416] 0000000000000000 ffff8801f8409ac0 ffffffff81055c97
ffff8801f8413d28
[ 1019.249416] ffff8803ffc13cc0 0000000000000001 ffff8801f8413cc0
0000000000000000
[ 1019.249416] Call Trace:
[ 1019.249416] <#MC> [<ffffffff815f33ff>] dump_stack+0x48/0x61
[ 1019.249416] [<ffffffff81055c97>] warn_slowpath_common+0x97/0xe0
[ 1019.249416] [<ffffffff81055cfa>] warn_slowpath_null+0x1a/0x20
[ 1019.249416] [<ffffffff81032aef>] native_smp_send_reschedule+0x5f/0x70
[ 1019.249416] [<ffffffff8108a24a>] trigger_load_balance+0x15a/0x200
[ 1019.249416] [<ffffffff8107e038>] scheduler_tick+0x88/0xa0
[ 1019.249416] [<ffffffff810ac3d1>] update_process_times+0x51/0x70
[ 1019.249416] [<ffffffff810bb7f0>] tick_sched_handle.clone.11+0x30/0x70
[ 1019.249416] [<ffffffff810bb92f>] tick_sched_timer+0x4f/0x90
[ 1019.249416] [<ffffffff810acbdc>] __run_hrtimer+0x6c/0x1b0
[ 1019.249416] [<ffffffff810bb8e0>] ? tick_nohz_handler+0xb0/0xb0
[ 1019.249416] [<ffffffff810ad393>] hrtimer_interrupt+0xe3/0x200
[ 1019.249416] [<ffffffff81035179>] local_apic_timer_interrupt+0x39/0x60
[ 1019.249416] [<ffffffff815fa355>] smp_apic_timer_interrupt+0x45/0x60
[ 1019.249416] [<ffffffff815f892a>] apic_timer_interrupt+0x6a/0x70
[ 1019.249416] [<ffffffff815f3170>] ? panic+0x1b9/0x1fb
[ 1019.249416] [<ffffffff815f316c>] ? panic+0x1b5/0x1fb
[ 1019.249416] [<ffffffff815f31f8>] ? printk+0x46/0x48
[ 1019.249416] [<ffffffff810295cf>] mce_panic+0x24f/0x270
[ 1019.249416] [<ffffffff8102a687>] do_machine_check+0x767/0xa60
[ 1019.249416] [<ffffffff815f95d6>] machine_check+0x26/0x50
[ 1019.249416] [<ffffffffa000b2c5>] ? pdc_interrupt+0x2d5/0x430 [sata_promise]
[ 1019.249416] <<EOE>> <IRQ> [<ffffffff8109d1a4>]
handle_irq_event_percpu+0x54/0x1a0
[ 1019.249416] [<ffffffff8109d332>] handle_irq_event+0x42/0x70
[ 1019.249416] [<ffffffff8109fcd9>] handle_fasteoi_irq+0x79/0x130
[ 1019.249416] [<ffffffff81006222>] handle_irq+0x22/0x40
[ 1019.249416] [<ffffffff815fa25c>] do_IRQ+0x5c/0x110
[ 1019.249416] [<ffffffff815f85ea>] common_interrupt+0x6a/0x6a
[ 1019.249416] <EOI> [<ffffffff811d3f57>] ? fsnotify+0xc7/0x340
[ 1019.249416] [<ffffffff811d40e4>] ? fsnotify+0x254/0x340
[ 1019.249416] [<ffffffff811968cf>] vfs_write+0x12f/0x1d0
[ 1019.249416] [<ffffffff81196c16>] SyS_write+0x56/0xd0
[ 1019.249416] [<ffffffff811da81e>] ? SyS_epoll_wait+0xbe/0xe0
[ 1019.249416] [<ffffffff815f7b32>] system_call_fastpath+0x12/0x17
[ 1019.249416] ---[ end trace 3ba0c941409cb2fb ]---
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/