Re: [PATCH trivial-v2] RCU: Call touch_nmi_watchdog() while printing stall warnings
From: Paul E. McKenney
Date: Tue Jan 09 2018 - 15:05:22 EST
On Tue, Jan 09, 2018 at 10:52:22AM -0800, Tejun Heo wrote:
> >From c1d1ae75ba4bde7c833560248b3f561085fa850e Mon Sep 17 00:00:00 2001
> From: Tejun Heo <tj@xxxxxxxxxx>
> Date: Tue, 9 Jan 2018 10:38:17 -0800
> Subject: [PATCH] RCU: Call touch_nmi_watchdog() while printing stall warnings
>
> When RCU stall warning triggers, it can print out a lot of messages
> while holding spinlocks. If the console device is slow (e.g. an
> actual or IPMI serial console), it may end up triggering NMI hard
> lockup watchdog like the following.
Queued for testing and review, thank you!
Thanx, Paul
> *** CPU printking while holding RCU spinlock
>
> PID: 4149739 TASK: ffff881a46baa880 CPU: 13 COMMAND: "CPUThreadPool8"
> #0 [ffff881fff945e48] crash_nmi_callback at ffffffff8103f7d0
> #1 [ffff881fff945e58] nmi_handle at ffffffff81020653
> #2 [ffff881fff945eb0] default_do_nmi at ffffffff81020c36
> #3 [ffff881fff945ed0] do_nmi at ffffffff81020d32
> #4 [ffff881fff945ef0] end_repeat_nmi at ffffffff81956a7e
> [exception RIP: io_serial_in+21]
> RIP: ffffffff81630e55 RSP: ffff881fff943b88 RFLAGS: 00000002
> RAX: 000000000000ca00 RBX: ffffffff8230e188 RCX: 0000000000000000
> RDX: 00000000000002fd RSI: 0000000000000005 RDI: ffffffff8230e188
> RBP: ffff881fff943bb0 R8: 0000000000000000 R9: ffffffff820cb3c4
> R10: 0000000000000019 R11: 0000000000002000 R12: 00000000000026e1
> R13: 0000000000000020 R14: ffffffff820cd398 R15: 0000000000000035
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
> --- <NMI exception stack> ---
> #5 [ffff881fff943b88] io_serial_in at ffffffff81630e55
> #6 [ffff881fff943b90] wait_for_xmitr at ffffffff8163175c
> #7 [ffff881fff943bb8] serial8250_console_putchar at ffffffff816317dc
> #8 [ffff881fff943bd8] uart_console_write at ffffffff8162ac00
> #9 [ffff881fff943c08] serial8250_console_write at ffffffff81634691
> #10 [ffff881fff943c80] univ8250_console_write at ffffffff8162f7c2
> #11 [ffff881fff943c90] console_unlock at ffffffff810dfc55
> #12 [ffff881fff943cf0] vprintk_emit at ffffffff810dffb5
> #13 [ffff881fff943d50] vprintk_default at ffffffff810e01bf
> #14 [ffff881fff943d60] vprintk_func at ffffffff810e1127
> #15 [ffff881fff943d70] printk at ffffffff8119a8a4
> #16 [ffff881fff943dd0] print_cpu_stall_info at ffffffff810eb78c
> #17 [ffff881fff943e88] rcu_check_callbacks at ffffffff810ef133
> #18 [ffff881fff943ee8] update_process_times at ffffffff810f3497
> #19 [ffff881fff943f10] tick_sched_timer at ffffffff81103037
> #20 [ffff881fff943f38] __hrtimer_run_queues at ffffffff810f3f38
> #21 [ffff881fff943f88] hrtimer_interrupt at ffffffff810f442b
>
> *** CPU triggering the hardlockup watchdog
>
> PID: 4149709 TASK: ffff88010f88c380 CPU: 26 COMMAND: "CPUThreadPool35"
> #0 [ffff883fff1059d0] machine_kexec at ffffffff8104a874
> #1 [ffff883fff105a30] __crash_kexec at ffffffff811116cc
> #2 [ffff883fff105af0] __crash_kexec at ffffffff81111795
> #3 [ffff883fff105b08] panic at ffffffff8119a6ae
> #4 [ffff883fff105b98] watchdog_overflow_callback at ffffffff81135dbd
> #5 [ffff883fff105bb0] __perf_event_overflow at ffffffff81186866
> #6 [ffff883fff105be8] perf_event_overflow at ffffffff81192bc4
> #7 [ffff883fff105bf8] intel_pmu_handle_irq at ffffffff8100b265
> #8 [ffff883fff105df8] perf_event_nmi_handler at ffffffff8100489f
> #9 [ffff883fff105e58] nmi_handle at ffffffff81020653
> #10 [ffff883fff105eb0] default_do_nmi at ffffffff81020b94
> #11 [ffff883fff105ed0] do_nmi at ffffffff81020d32
> #12 [ffff883fff105ef0] end_repeat_nmi at ffffffff81956a7e
> [exception RIP: queued_spin_lock_slowpath+248]
> RIP: ffffffff810da958 RSP: ffff883fff103e68 RFLAGS: 00000046
> RAX: 0000000000000000 RBX: 0000000000000046 RCX: 00000000006d0000
> RDX: ffff883fff49a950 RSI: 0000000000d10101 RDI: ffffffff81e54300
> RBP: ffff883fff103e80 R8: ffff883fff11a950 R9: 0000000000000000
> R10: 000000000e5873ba R11: 000000000000010f R12: ffffffff81e54300
> R13: 0000000000000000 R14: ffff88010f88c380 R15: ffffffff81e54300
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> --- <NMI exception stack> ---
> #13 [ffff883fff103e68] queued_spin_lock_slowpath at ffffffff810da958
> #14 [ffff883fff103e70] _raw_spin_lock_irqsave at ffffffff8195550b
> #15 [ffff883fff103e88] rcu_check_callbacks at ffffffff810eed18
> #16 [ffff883fff103ee8] update_process_times at ffffffff810f3497
> #17 [ffff883fff103f10] tick_sched_timer at ffffffff81103037
> #18 [ffff883fff103f38] __hrtimer_run_queues at ffffffff810f3f38
> #19 [ffff883fff103f88] hrtimer_interrupt at ffffffff810f442b
> --- <IRQ stack> ---
>
> Avoid spuriously triggering NMI hardlockup watchdog by touching it
> from the print functions. show_state_filter() shares the same problem
> and solution.
>
> v2: Relocate the comment to where it belongs.
>
> Signed-off-by: Tejun Heo <tj@xxxxxxxxxx>
> ---
> kernel/rcu/tree_plugin.h | 14 +++++++++++++-
> 1 file changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> index db85ca3..bf05bf9 100644
> --- a/kernel/rcu/tree_plugin.h
> +++ b/kernel/rcu/tree_plugin.h
> @@ -561,8 +561,14 @@ static void rcu_print_detail_task_stall_rnp(struct rcu_node *rnp)
> }
> t = list_entry(rnp->gp_tasks->prev,
> struct task_struct, rcu_node_entry);
> - list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry)
> + list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) {
> + /*
> + * We could be printing a lot while holding a spinlock.
> + * Avoid triggering hard lockup.
> + */
> + touch_nmi_watchdog();
> sched_show_task(t);
> + }
> raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
> }
>
> @@ -1678,6 +1684,12 @@ static void print_cpu_stall_info(struct rcu_state *rsp, int cpu)
> char *ticks_title;
> unsigned long ticks_value;
>
> + /*
> + * We could be printing a lot while holding a spinlock. Avoid
> + * triggering hard lockup.
> + */
> + touch_nmi_watchdog();
> +
> if (rsp->gpnum == rdp->gpnum) {
> ticks_title = "ticks this GP";
> ticks_value = rdp->ticks_this_gp;
> --
> 2.9.5
>