nohz fail (was: perf related boot hang.)

From: Peter Zijlstra
Date: Thu Aug 07 2014 - 05:03:51 EST


On Wed, Aug 06, 2014 at 03:46:56PM -0400, Dave Jones wrote:
> This one happened during runtime, but I got a whole stack..
>
>
> Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2
> CPU: 2 PID: 7538 Comm: kworker/u8:8 Not tainted 3.16.0+ #34
> Workqueue: btrfs-endio-write normal_work_helper [btrfs]
> ffff880244c06c88 000000001b486fe1 ffff880244c06bf0 ffffffff8a7f1e37
> ffffffff8ac52a18 ffff880244c06c78 ffffffff8a7ef928 0000000000000010
> ffff880244c06c88 ffff880244c06c20 000000001b486fe1 0000000000000000
> Call Trace:
> <NMI> [<ffffffff8a7f1e37>] dump_stack+0x4e/0x7a
> [<ffffffff8a7ef928>] panic+0xd4/0x207
> [<ffffffff8a1450e8>] watchdog_overflow_callback+0x118/0x120
> [<ffffffff8a186b0e>] __perf_event_overflow+0xae/0x350
> [<ffffffff8a184f80>] ? perf_event_task_disable+0xa0/0xa0
> [<ffffffff8a01a4cf>] ? x86_perf_event_set_period+0xbf/0x150
> [<ffffffff8a187934>] perf_event_overflow+0x14/0x20
> [<ffffffff8a020386>] intel_pmu_handle_irq+0x206/0x410
> [<ffffffff8a01937b>] perf_event_nmi_handler+0x2b/0x50
> [<ffffffff8a007b72>] nmi_handle+0xd2/0x390
> [<ffffffff8a007aa5>] ? nmi_handle+0x5/0x390
> [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
> [<ffffffff8a008062>] default_do_nmi+0x72/0x1c0
> [<ffffffff8a008268>] do_nmi+0xb8/0x100
> [<ffffffff8a7ff66a>] end_repeat_nmi+0x1e/0x2e
> [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
> [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0
> [<ffffffff8a0cb7f8>] ? match_held_lock+0x8/0x1b0

Ok so that part is just the watchdog triggering, so the below part is
the screwy bit:

> <<EOE>> <IRQ> [<ffffffff8a0ccd2f>] lock_acquired+0xaf/0x450
> [<ffffffff8a0f74c5>] ? lock_hrtimer_base.isra.20+0x25/0x50
> [<ffffffff8a7fc678>] _raw_spin_lock_irqsave+0x78/0x90
> [<ffffffff8a0f74c5>] ? lock_hrtimer_base.isra.20+0x25/0x50
> [<ffffffff8a0f74c5>] lock_hrtimer_base.isra.20+0x25/0x50
> [<ffffffff8a0f7723>] hrtimer_try_to_cancel+0x33/0x1e0
> [<ffffffff8a0f78ea>] hrtimer_cancel+0x1a/0x30
> [<ffffffff8a109237>] tick_nohz_restart+0x17/0x90
> [<ffffffff8a10a213>] __tick_nohz_full_check+0xc3/0x100
> [<ffffffff8a10a25e>] nohz_full_kick_work_func+0xe/0x10
> [<ffffffff8a17c884>] irq_work_run_list+0x44/0x70
> [<ffffffff8a17c8da>] irq_work_run+0x2a/0x50
> [<ffffffff8a0f700b>] update_process_times+0x5b/0x70
> [<ffffffff8a109005>] tick_sched_handle.isra.21+0x25/0x60
> [<ffffffff8a109b81>] tick_sched_timer+0x41/0x60
> [<ffffffff8a0f7aa2>] __run_hrtimer+0x72/0x470
> [<ffffffff8a109b40>] ? tick_sched_do_timer+0xb0/0xb0
> [<ffffffff8a0f8707>] hrtimer_interrupt+0x117/0x270
> [<ffffffff8a034357>] local_apic_timer_interrupt+0x37/0x60
> [<ffffffff8a80010f>] smp_apic_timer_interrupt+0x3f/0x50
> [<ffffffff8a7fe52f>] apic_timer_interrupt+0x6f/0x80

And that looks like someone trying to cancel a timer from a timer, I
guess that won't work, seeing how cancel will wait for the timer handler
completion etc.

This is because of the fallback irq_work_run() in the tick
(update_process_times).

Attachment: pgpCiGBfNMLJW.pgp
Description: PGP signature