Re: Watchdog detected hard lockup.

From: Sergey Senozhatsky
Date: Thu May 18 2017 - 03:33:03 EST

Next message: Paolo Valente: "Re: races between blk-cgroup operations and I/O scheds in blk-mq (?)"
Previous message: Joel Stanley: "Re: [PATCH 7/8] clocksource/drivers/fttmr010: Merge Moxa into FTTMR010"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello Sumit,

please Cc kernel mailing list.

Cc Petr, Peter, Steven, kernel-list

On (05/18/17 12:22), Sumit Gemini wrote:
> Hi Sergey,
>
> I read your comment on [1]https://lkml.org/lkml/2016/9/27/279. Here you were talking about using alt_printk instead of printk.
> I Got this crash, could this crash related to call printk recursively?
> Could you please tell me why i got this crash? Do you see any suspicious entry here?
>
> [2324956.184374] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 1
> [2324956.184374] Pid: 0, comm: kworker/0:0 Tainted: P 3.1.10-gb20-default #1
> [2324956.184374] Call Trace:
> [2324956.184374] [<ffffffff81003add>] dump_trace+0x98/0x223
> [2324956.184374] [<ffffffff813dc659>] dump_stack+0x69/0x6f
> [2324956.184374] [<ffffffff813def08>] panic+0xa4/0x1b4
> [2324956.184374] [<ffffffff81094b48>] watchdog_overflow_callback+0x79/0x9e
> [2324956.184374] [<ffffffff810b66b5>] __perf_event_overflow+0xf9/0x185
> [2324956.184374] [<ffffffff81010a2c>] intel_pmu_handle_irq+0x15d/0x1be
> [2324956.184374] [<ffffffff813f375b>] perf_event_nmi_handler+0x3e/0x87
> [2324956.184374] [<ffffffff813f4c0f>] notifier_call_chain+0x2e/0x5b
> [2324956.184374] [<ffffffff813f4c76>] __atomic_notifier_call_chain+0x3a/0x4d
> [2324956.184374] [<ffffffff813f4cc1>] notify_die+0x2d/0x32
> [2324956.184374] [<ffffffff813f2bbb>] default_do_nmi+0x29/0x1b5
> [2324956.184374] [<ffffffff813f2eb7>] do_nmi+0x47/0x6f
> [2324956.184374] [<ffffffff813f2a30>] nmi+0x20/0x30
> [2324956.184374] [<ffffffff8106b725>] arch_local_irq_save+0x1/0x17
> [2324956.184374] [<ffffffff813f1e9b>] _raw_spin_lock_irqsave+0xf/0x39
> [2324956.184374] [<ffffffff8105d9b7>] down_trylock+0xb/0x29
> [2324956.184374] [<ffffffff810409c2>] console_trylock+0xf/0x47
> [2324956.184374] [<ffffffff81040dc9>] console_unlock+0x100/0x129
> [2324956.184374] [<ffffffff8104120b>] vprintk+0x348/0x394
> [2324956.184374] [<ffffffff813df06b>] printk+0x53/0x58
> [2324956.184374] [<ffffffff81359902>] ip_handle_martian_source+0x6d/0xf6
> [2324956.184374] [<ffffffff8135b980>] ip_route_input_slow+0x45d/0x462
> [2324956.184374] [<ffffffff8135c02f>] ip_route_input_common+0x217/0x231
> [2324956.184374] [<ffffffff8137e995>] arp_process+0x181/0x4ac
> [2324956.184374] [<ffffffff81337bf5>] __netif_receive_skb+0x317/0x36b
> [2324956.184374] [<ffffffff81337b44>] __netif_receive_skb+0x266/0x36b
> [2324956.184374] [<ffffffff81339722>] netif_receive_skb+0x7e/0x84
> [2324956.184374] [<ffffffff8133979e>] napi_skb_finish+0x1c/0x31
> [2324956.184374] [<ffffffffa02e1dee>] igb_clean_rx_irq+0x30d/0x39e [igb]
> [2324956.184374] [<ffffffffa02e1ecd>] igb_poll+0x4e/0x74 [igb]
> [2324956.184374] [<ffffffff81339c88>] net_rx_action+0x65/0x178
> [2324956.184374] [<ffffffff81045c73>] __do_softirq+0xb2/0x19d
> [2324956.184374] [<ffffffff813f9aac>] call_softirq+0x1c/0x30
> [2324956.184374] [<ffffffff81003931>] do_softirq+0x3c/0x7b
> [2324956.184374] [<ffffffff81045f98>] irq_exit+0x3c/0xac
> [2324956.184374] [<ffffffff81003655>] do_IRQ+0x82/0x98
> [2324956.184374] [<ffffffff813f24ee>] common_interrupt+0x6e/0x6e
> [2324956.184374] [<ffffffff8123f91d>] intel_idle+0xdd/0x104
> [2324956.184374] [<ffffffff8130b76b>] cpuidle_idle_call+0xdf/0x181
> [2324956.184374] [<ffffffff81001201>] cpu_idle+0x56/0xa9
>
> I shell be very greatful to you.
>

your CPU1 couldn't acquire console_sem spin_lock

down_trylock()
raw_spin_lock_irqsave(&sem->lock, flags);

and _probably_ looped on it with local IRQs disabled long enough to
disappoint the NMI watchdog, which eventually declared a hard lockup
condition. but I'm not entirely sure it stuck looping on sem->lock,
it could be the case that your net IRQ took too much time and by the
time NMI declared hard lockup we were just unlucky to execute
down_trylock().

on the other hand, the fact that you see

IRQ
ip_handle_martian_source()
printk()
console_unlock()

is already not really good. first of all, you had hit ip_route_input slow
path; second, you hit pr_warn()/pr_err or print_hex_dump(), so the kernel
was reporting you something abnormal. printk() invoked from IRQ context
managed to acquire the console_sem and spent some time printing the messages
to the serial console, console_unlock() in the call trace. which sometimes
can be a call for troubles. printk() can't always be safely called from
atomic contexts, just because of cases like this.

I don't know how big ->hard_header_len can be, but that print_hex_dump()
can take some extra time. if you also happened to have other printk messages
coming from other CPUs in the meantime, then CPU1 had have to spent even
more time in console_unlock() printing those messages; from IRQ handler.

hm, this doesn't look to me as a printk recursion, at least after a
quick look. either,

a) something was holding sem->lock, can't tell what exactly (unlikely)

or

b) your IRQ handler just happened to execute way too long. given that
there was printk()->console_unlock() and probably a hex dump of
skb, this is a more likely root cause.

-ss

Next message: Paolo Valente: "Re: races between blk-cgroup operations and I/O scheds in blk-mq (?)"
Previous message: Joel Stanley: "Re: [PATCH 7/8] clocksource/drivers/fttmr010: Merge Moxa into FTTMR010"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]