Re: frequent lockups in 3.18rc4

From: Andy Lutomirski
Date: Wed Nov 19 2014 - 16:01:49 EST


On 11/19/2014 06:59 AM, Dave Jones wrote:
> On Tue, Nov 18, 2014 at 08:40:55PM -0800, Linus Torvalds wrote:
> > On Tue, Nov 18, 2014 at 6:19 PM, Dave Jones <davej@xxxxxxxxxx> wrote:
> > >
> > > NMI watchdog: BUG: soft lockup - CPU#2 stuck for 21s! [trinity-c42:31480]
> > > CPU: 2 PID: 31480 Comm: trinity-c42 Not tainted 3.18.0-rc5+ #91 [loadavg: 174.61 150.35 148.64 9/411 32140]
> > > RIP: 0010:[<ffffffff8a1798b4>] [<ffffffff8a1798b4>] context_tracking_user_enter+0xa4/0x190
> > > Call Trace:
> > > [<ffffffff8a012fc5>] syscall_trace_leave+0xa5/0x160
> > > [<ffffffff8a7d8624>] int_check_syscall_exit_work+0x34/0x3d
> >
> > Hmm, if we are getting soft-lockups here, maybe it suggest too much exit-work.
> >
> > Some TIF_NOHZ loop, perhaps? You have nohz on, don't you?
> >
> > That makes me wonder: does the problem go away if you disable NOHZ?
>
> Aparently not.

TIF_NOHZ is not the same thing as NOHZ. Can you try a kernel with
CONFIG_CONTEXT_TRACKING=n? Doing that may involve fiddling with RCU
settings a bit. The normal no HZ idle stuff has nothing to do with
TIF_NOHZ, and you either have TIF_NOHZ set or you have some kind of
thread_info corruption going on here.

Hmm. This isn't a stack overflow, is it? That could cause all of these
problems quite easily, although I'd expect other symptoms, too.

>
> NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [trinity-c75:25175]
> CPU: 3 PID: 25175 Comm: trinity-c75 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 9/410 27945]
> task: ffff8800364e44d0 ti: ffff880192d2c000 task.ti: ffff880192d2c000
> RIP: 0010:[<ffffffff94175be7>] [<ffffffff94175be7>] context_tracking_user_exit+0x57/0x120

This RIP should be impossible if context tracking is off.

> RSP: 0018:ffff880192d2fee8 EFLAGS: 00000246
> RAX: 0000000000000000 RBX: 0000000100000046 RCX: 000000336ee35b47

^^^^^^^^^

That is a strange coincidence. Where did 0x46 | (1<<32) come from?
That's a sensible interrupts-disabled flags value with the high part set
to 0x1. Those high bits are undefined, but they ought to all be zero.

> RDX: 0000000000000001 RSI: ffffffff94ac1e84 RDI: ffffffff94a93725
> RBP: ffff880192d2fef8 R08: 00007f9b74d0b740 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: ffffffff940d8503
> R13: ffff880192d2fe98 R14: ffffffff943884e7 R15: ffff880192d2fe48
> FS: 00007f9b74d0b740(0000) GS:ffff880244600000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 000000336f1b7740 CR3: 0000000229a95000 CR4: 00000000001407e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
> Stack:
> ffff880192d30000 0000000000080000 ffff880192d2ff78 ffffffff94012c25
> 00007f9b747a5000 00007f9b747a5068 0000000000000000 0000000000000000
> 0000000000000000 ffffffff9437b3be 0000000000000000 0000000000000000
> Call Trace:
> [<ffffffff94012c25>] syscall_trace_enter_phase1+0x125/0x1a0
> [<ffffffff9437b3be>] ? trace_hardirqs_on_thunk+0x3a/0x3f
> [<ffffffff947d41bf>] tracesys+0x14/0x4a
> Code: 42 fd ff 48 c7 c7 7a 1e ac 94 e8 25 29 21 00 65 8b 04 25 34 f7 1c 00 83 f8 01 74 28 f6 c7 02 74 13 0f 1f 00 e8 bb 43 fd ff 53 9d <5b> 41 5c 5d c3 0f 1f 40 00 53 9d e8 89 42 fd ff eb ee 0f 1f 80
> sending NMI to other CPUs:
> NMI backtrace for cpu 1
> CPU: 1 PID: 25164 Comm: trinity-c64 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 9/410 27945]
> task: ffff88011600dbc0 ti: ffff8801a99a4000 task.ti: ffff8801a99a4000
> RIP: 0010:[<ffffffff940fb71e>] [<ffffffff940fb71e>] generic_exec_single+0xee/0x1a0
> RSP: 0018:ffff8801a99a7d18 EFLAGS: 00000202
> RAX: 0000000000000000 RBX: ffff8801a99a7d20 RCX: 0000000000000038
> RDX: 00000000000000ff RSI: 0000000000000008 RDI: 0000000000000000
> RBP: ffff8801a99a7d78 R08: ffff880242b57ce0 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000003
> R13: 0000000000000001 R14: ffff880083c28948 R15: ffffffff94166aa0
> FS: 00007f9b74d0b740(0000) GS:ffff880244200000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000001 CR3: 00000001d8611000 CR4: 00000000001407e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
> Stack:
> ffff8801a99a7d28 0000000000000000 ffffffff94166aa0 ffff880083c28948
> 0000000000000003 00000000e38f9aac ffff880083c28948 00000000ffffffff
> 0000000000000003 ffffffff94166aa0 ffff880083c28948 0000000000000001
> Call Trace:
> [<ffffffff94166aa0>] ? perf_swevent_add+0x120/0x120
> [<ffffffff94166aa0>] ? perf_swevent_add+0x120/0x120
> [<ffffffff940fb89a>] smp_call_function_single+0x6a/0xe0
> [<ffffffff940a172b>] ? preempt_count_sub+0x7b/0x100
> [<ffffffff941671aa>] perf_event_read+0xca/0xd0
> [<ffffffff94167240>] perf_event_read_value+0x90/0xe0
> [<ffffffff941689c6>] perf_read+0x226/0x370
> [<ffffffff942fbfb7>] ? security_file_permission+0x87/0xa0
> [<ffffffff941eafff>] vfs_read+0x9f/0x180
> [<ffffffff941ebbd8>] SyS_read+0x58/0xd0
> [<ffffffff947d42c9>] tracesys_phase2+0xd4/0xd9

Riddle me this: what are we doing in tracesys_phase2? This is a full
slow-path syscall. TIF_NOHZ doesn't cause that, I think. I'd love to
see the value of ti->flags here. Is trinity using ptrace?

Um. There's a bug. Patch coming after lunch. No clue whether it will
help here.

--Andy

> Code: 48 89 de 48 03 14 c5 20 65 d1 94 48 89 df e8 8a 4b 28 00 84 c0 75 46 45 85 ed 74 11 f6 43 18 01 74 0b 0f 1f 00 f3 90 f6 43 18 01 <75> f8 31 c0 48 8b 4d c8 65 48 33 0c 25 28 00 00 00 0f 85 8e 00
> NMI backtrace for cpu 0
> INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 35.055 msecs
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 11/410 27945]
> task: ffffffff94c164c0 ti: ffffffff94c00000 task.ti: ffffffff94c00000
> RIP: 0010:[<ffffffff943dd415>] [<ffffffff943dd415>] intel_idle+0xd5/0x180
> RSP: 0018:ffffffff94c03e28 EFLAGS: 00000046
> RAX: 0000000000000032 RBX: 0000000000000010 RCX: 0000000000000001
> RDX: 0000000000000000 RSI: ffffffff94c03fd8 RDI: 0000000000000000
> RBP: ffffffff94c03e58 R08: 000000008baf8b86 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000005
> R13: 0000000000000032 R14: 0000000000000004 R15: ffffffff94c00000
> FS: 0000000000000000(0000) GS:ffff880244000000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f162e060000 CR3: 0000000014c11000 CR4: 00000000001407f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
> Stack:
> 0000000094c03e58 5955c5b31ad5e8cf ffffe8fffee031a8 0000000000000005
> ffffffff94ca9dc0 0000000000000000 ffffffff94c03ea8 ffffffff94661f05
> 00001cb7dcf6fd93 ffffffff94ca9f90 ffffffff94c00000 ffffffff94d18870
> Call[31557.908912] NMI backtrace for cpu 2
> INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 68.178 msecs
> CPU: 2 PID: 0 Comm: swapper/2 Not tainted 3.18.0-rc5+ #92 [loadavg: 168.72 151.72 150.38 10/410 27945]
> task: ffff880242b596f0 ti: ffff880242b6c000 task.ti: ffff880242b6c000
> RIP: 0010:[<ffffffff943dd415>] [<ffffffff943dd415>] intel_idle+0xd5/0x180
> RSP: 0018:ffff880242b6fdf8 EFLAGS: 00000046
> RAX: 0000000000000032 RBX: 0000000000000010 RCX: 0000000000000001
> RDX: 0000000000000000 RSI: ffff880242b6ffd8 RDI: 0000000000000002
> RBP: ffff880242b6fe28 R08: 000000008baf8b86 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000005
> R13: 0000000000000032 R14: 0000000000000004 R15: ffff880242b6c000
> FS: 0000000000000000(0000) GS:ffff880244400000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000000 CR3: 0000000014c11000 CR4: 00000000001407e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
> Stack:
> 0000000242b6fe28 da97aa9b9f42090a ffffe8ffff2031a8 0000000000000005
> ffffffff94ca9dc0 0000000000000002 ffff880242b6fe78 ffffffff94661f05
> 00001cb7dcdd1af6 ffffffff94ca9f90 ffff880242b6c000 ffffffff94d18870
> Call Trace:
> [<ffffffff94661f05>] cpuidle_enter_state+0x55/0x1c0
> [<ffffffff94662127>] cpuidle_enter+0x17/0x20
> [<ffffffff940b94a3>] cpu_startup_entry+0x423/0x4d0
> [<ffffffff9402b763>] start_secondary+0x1a3/0x220
> Code: 31 d2 65 48 8b 34 25 08 ba 00 00 48 8d 86 38 c0 ff ff 48 89 d1 0f 01 c8 48 8b 86 38 c0 ff ff a8 08 75 08 b1 01 4c 89 e8 0f 01 c9 <65> 48 8b 0c 25 08 ba 00 00 f0 80 a1 3a c0 ff ff df 0f ae f0 48
> INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 95.994 msecs
>
>
> <tangent>
>
> Also, today I learned we can reach the perf_event_read code from
> read(). Given I had /proc/sys/kernel/perf_event_paranoid set to 1,
> I'm not sure how this is even possible. The only user of perf_fops
> is perf_event_open syscall _after_ it's checked that sysctl.
>
> Oh, there's an ioctl path to perf too. Though trinity
> doesn't know anything about it, so I find it surprising if it
> managed to pull the right combination of entropy to make that
> do the right thing. Still, that ioctl path probably needs
> to also be checking that sysctl shouldn't it ?
>
> Dave
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/