Re: frequent lockups in 3.18rc4

From: Frederic Weisbecker
Date: Thu Nov 20 2014 - 10:04:22 EST


On Wed, Nov 19, 2014 at 09:59:02AM -0500, Dave Jones wrote:
> On Tue, Nov 18, 2014 at 08:40:55PM -0800, Linus Torvalds wrote:
> > On Tue, Nov 18, 2014 at 6:19 PM, Dave Jones <davej@xxxxxxxxxx> wrote:
> > >
> > > NMI watchdog: BUG: soft lockup - CPU#2 stuck for 21s! [trinity-c42:31480]
> > > CPU: 2 PID: 31480 Comm: trinity-c42 Not tainted 3.18.0-rc5+ #91 [loadavg: 174.61 150.35 148.64 9/411 32140]
> > > RIP: 0010:[<ffffffff8a1798b4>] [<ffffffff8a1798b4>] context_tracking_user_enter+0xa4/0x190
> > > Call Trace:
> > > [<ffffffff8a012fc5>] syscall_trace_leave+0xa5/0x160
> > > [<ffffffff8a7d8624>] int_check_syscall_exit_work+0x34/0x3d
> >
> > Hmm, if we are getting soft-lockups here, maybe it suggest too much exit-work.
> >
> > Some TIF_NOHZ loop, perhaps? You have nohz on, don't you?
> >
> > That makes me wonder: does the problem go away if you disable NOHZ?
>
> Aparently not.
>
> NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [trinity-c75:25175]
> CPU: 3 PID: 25175 Comm: trinity-c75 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 9/410 27945]
> task: ffff8800364e44d0 ti: ffff880192d2c000 task.ti: ffff880192d2c000
> RIP: 0010:[<ffffffff94175be7>] [<ffffffff94175be7>] context_tracking_user_exit+0x57/0x120
> RSP: 0018:ffff880192d2fee8 EFLAGS: 00000246
> RAX: 0000000000000000 RBX: 0000000100000046 RCX: 000000336ee35b47
> RDX: 0000000000000001 RSI: ffffffff94ac1e84 RDI: ffffffff94a93725
> RBP: ffff880192d2fef8 R08: 00007f9b74d0b740 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: ffffffff940d8503
> R13: ffff880192d2fe98 R14: ffffffff943884e7 R15: ffff880192d2fe48
> FS: 00007f9b74d0b740(0000) GS:ffff880244600000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 000000336f1b7740 CR3: 0000000229a95000 CR4: 00000000001407e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
> Stack:
> ffff880192d30000 0000000000080000 ffff880192d2ff78 ffffffff94012c25
> 00007f9b747a5000 00007f9b747a5068 0000000000000000 0000000000000000
> 0000000000000000 ffffffff9437b3be 0000000000000000 0000000000000000
> Call Trace:
> [<ffffffff94012c25>] syscall_trace_enter_phase1+0x125/0x1a0
> [<ffffffff9437b3be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [<ffffffff947d41bf>] tracesys+0x14/0x4a
> Code: 42 fd ff 48 c7 c7 7a 1e ac 94 e8 25 29 21 00 65 8b 04 25 34 f7 1c 00 83 f8 01 74 28 f6 c7 02 74 13 0f 1f 00 e8 bb 43 fd ff 53 9d <5b> 41 5c 5d c3 0f 1f 40 00 53 9d e8 89 42 fd ff eb ee 0f 1f 80
> sending NMI to other CPUs:
> NMI backtrace for cpu 1
> CPU: 1 PID: 25164 Comm: trinity-c64 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 9/410 27945]
> task: ffff88011600dbc0 ti: ffff8801a99a4000 task.ti: ffff8801a99a4000
> RIP: 0010:[<ffffffff940fb71e>] [<ffffffff940fb71e>] generic_exec_single+0xee/0x1a0
> RSP: 0018:ffff8801a99a7d18 EFLAGS: 00000202
> RAX: 0000000000000000 RBX: ffff8801a99a7d20 RCX: 0000000000000038
> RDX: 00000000000000ff RSI: 0000000000000008 RDI: 0000000000000000
> RBP: ffff8801a99a7d78 R08: ffff880242b57ce0 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000003
> R13: 0000000000000001 R14: ffff880083c28948 R15: ffffffff94166aa0
> FS: 00007f9b74d0b740(0000) GS:ffff880244200000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000001 CR3: 00000001d8611000 CR4: 00000000001407e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
> Stack:
> ffff8801a99a7d28 0000000000000000 ffffffff94166aa0 ffff880083c28948
> 0000000000000003 00000000e38f9aac ffff880083c28948 00000000ffffffff
> 0000000000000003 ffffffff94166aa0 ffff880083c28948 0000000000000001
> Call Trace:
> [<ffffffff94166aa0>] ? perf_swevent_add+0x120/0x120
> [<ffffffff94166aa0>] ? perf_swevent_add+0x120/0x120
> [<ffffffff940fb89a>] smp_call_function_single+0x6a/0xe0

One thing that happens a lot in your crashes is a CPU sending IPIs. Maybe
stuck polling on csd->lock or something. But's it's not the CPU that soft
lockups. At least not the first that gets reported.

> [<ffffffff940a172b>] ? preempt_count_sub+0x7b/0x100
> [<ffffffff941671aa>] perf_event_read+0xca/0xd0
> [<ffffffff94167240>] perf_event_read_value+0x90/0xe0
> [<ffffffff941689c6>] perf_read+0x226/0x370
> [<ffffffff942fbfb7>] ? security_file_permission+0x87/0xa0
> [<ffffffff941eafff>] vfs_read+0x9f/0x180
> [<ffffffff941ebbd8>] SyS_read+0x58/0xd0
> [<ffffffff947d42c9>] tracesys_phase2+0xd4/0xd9
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/