Re: more hangs in the tty layer

From: Peter Hurley
Date: Thu Nov 03 2016 - 20:40:12 EST


On Tue, Oct 4, 2016 at 2:10 AM, Nikolay Borisov <kernel@xxxxxxxx> wrote:
> Hello Peter,
>
> I've emailed you before re. spurious hangs in the TTY layer, but
> at that time I was running a rather old (but LTS) 3.12 kernel. Now,
> I'm running a 4.4.10 and I still observe the following lock-ups.
> I have multiple processes which hang with the following callstack:
>
> [4592064.967445] sulogin D ffff8839bca53b28 0 38279 38274 0x00000006
> [4592064.967448] ffff8839bca53b28 ffff881ff2ed44c0 ffff883748d85280 ffff8839bca54000
> [4592064.967452] ffff883748d85280 ffff881ff07c988c 00000000ffffffff ffff881ff07c9890
> [4592064.967454] ffff8839bca53b40 ffffffff8161020c ffff881ff07c9888 ffff8839bca53b50
> [4592064.967458] Call Trace:
> [4592064.967460] [<ffffffff8161020c>] schedule+0x3c/0x90
> [4592064.967463] [<ffffffff81610505>] schedule_preempt_disabled+0x15/0x20
> [4592064.967466] [<ffffffff816121e0>] __mutex_lock_slowpath+0xb0/0x130
> [4592064.967468] [<ffffffff8161228d>] mutex_lock+0x2d/0x40
> [4592064.967471] [<ffffffff81614388>] tty_lock+0x38/0x90
> [4592064.967474] [<ffffffff8138e6a5>] tty_release+0x55/0x590
> [4592064.967478] [<ffffffff8119a8ea>] __fput+0xea/0x1f0
> [4592064.967481] [<ffffffff8119aa2e>] ____fput+0xe/0x10
> [4592064.967485] [<ffffffff8106fc6f>] task_work_run+0x7f/0xb0
> [4592064.967488] [<ffffffff810554d3>] do_exit+0x373/0xb70
> [4592064.967490] [<ffffffff816120c2>] ? __mutex_unlock_slowpath+0x32/0x40
> [4592064.967493] [<ffffffff816120eb>] ? mutex_unlock+0x1b/0x20
> [4592064.967495] [<ffffffff81390e5b>] ? n_tty_read+0x31b/0x9e0
> [4592064.967498] [<ffffffff81055d53>] do_group_exit+0x43/0xd0
> [4592064.967500] [<ffffffff81061bd1>] get_signal+0x2e1/0x710
> [4592064.967503] [<ffffffff810053c8>] do_signal+0x28/0x770
> [4592064.967505] [<ffffffff81394d36>] ? tty_ldisc_deref+0x16/0x20
> [4592064.967508] [<ffffffff8138cfbc>] ? tty_read+0x9c/0xf0
> [4592064.967511] [<ffffffff811987a8>] ? __vfs_read+0x28/0xd0
> [4592064.967514] [<ffffffff812a7d53>] ? security_file_permission+0xa3/0xc0
> [4592064.967517] [<ffffffff810021da>] exit_to_usermode_loop+0x8a/0xd0
> [4592064.967520] [<ffffffff81002a90>] syscall_return_slowpath+0x80/0xf0
> [4592064.967523] [<ffffffff816145f4>] int_ret_from_sys_call+0x25/0x9f
>
> After debugging the memory dump of the server it turned out the
> following task is holding the mutex on which the sulogin is stuck:
>
> PID: 38274 TASK: ffff88395a109b80 CPU: 13 COMMAND: "sh"
> #0 [ffff8837aba27ac8] __schedule at ffffffff8160fa9a
> #1 [ffff8837aba27b10] schedule at ffffffff8161020c
> #2 [ffff8837aba27b28] schedule_timeout at ffffffff81613472
> #3 [ffff8837aba27ba8] ldsem_down_write at ffffffff81613cc5
> #4 [ffff8837aba27c10] tty_ldisc_lock at ffffffff816142e6
> #5 [ffff8837aba27c28] tty_ldisc_hangup at ffffffff8139561a
> #6 [ffff8837aba27c50] __tty_hangup at ffffffff8138cd03
> #7 [ffff8837aba27ca0] disassociate_ctty at ffffffff8138d2b9
> #8 [ffff8837aba27cd0] disassociate_ctty at ffffffff8138d5d9
> #9 [ffff8837aba27ce0] do_exit at ffffffff81055908
> #10 [ffff8837aba27d48] do_group_exit at ffffffff81055d53
> #11 [ffff8837aba27d70] get_signal at ffffffff81061bd1
> #12 [ffff8837aba27de8] do_signal at ffffffff810053c8
> #13 [ffff8837aba27ee0] exit_to_usermode_loop at ffffffff810021da
> #14 [ffff8837aba27f18] syscall_return_slowpath at ffffffff81002a90
> #15 [ffff8837aba27f50] int_ret_from_sys_call at ffffffff816145f4
> RIP: 00007f18fde88aba RSP: 00007ffc47a1fa78 RFLAGS: 00000246
> RAX: fffffffffffffe00 RBX: 0000561242611d20 RCX: 00007f18fde88aba
> RDX: 0000000000000000 RSI: 00007ffc47a1fa9c RDI: 00000000ffffffff
> RBP: 000056124260f948 R8: 0000000000000000 R9: 00007f18fe3a5740
> R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffc47a1fb20
> R13: 00007ffc47a1faa0 R14: 0000000000000000 R15: 0000000000000001
> ORIG_RAX: 000000000000003d CS: 0033 SS: 002b
>
> But I also observed also the following hung-up task,
> waiting for read access:
>
> [4592064.968193] index.pl D ffff881430387ce8 0 47168 1 0x00000004
> [4592064.968196] ffff881430387ce8 ffff881ff2ed0dc0 ffff8808a0960000 ffff881430388000
> [4592064.968200] ffff881ff07c9828 ffff881ff07c9830 ffff8808a0960000 7fffffffffffffff
> [4592064.968203] ffff881430387d00 ffffffff8161020c ffffffff00000004 ffff881430387d80
> [4592064.968205] Call Trace:
> [4592064.968208] [<ffffffff8161020c>] schedule+0x3c/0x90
> [4592064.968211] [<ffffffff81613472>] schedule_timeout+0x202/0x260
> [4592064.968216] [<ffffffff812a6a48>] ? security_capable+0x48/0x60
> [4592064.968219] [<ffffffff81613ab2>] ldsem_down_read+0x122/0x240
> [4592064.968222] [<ffffffff813992cc>] ? vt_ioctl+0x6c/0x1460
> [4592064.968225] [<ffffffff81394c90>] tty_ldisc_ref_wait+0x20/0x50
> [4592064.968228] [<ffffffff8138da37>] tty_ioctl+0x417/0xd40
> [4592064.968232] [<ffffffff8115a976>] ? handle_mm_fault+0x19b6/0x19c0
> [4592064.968236] [<ffffffff811abe8f>] do_vfs_ioctl+0x30f/0x560
> [4592064.968239] [<ffffffff811af8f2>] ? dput+0x192/0x230
> [4592064.968242] [<ffffffff811ac159>] SyS_ioctl+0x79/0x90
> [4592064.968245] [<ffffffff8161449b>] entry_SYSCALL_64_fastpath+0x16/0x6e
>
> I checked and all the tasks are working on the same tty_struct. Here
> is the state of the struct: http://paste.ubuntu.com/23273973/
>
> To me this seems like a lost wakeup event, especially that I've
> seen patches flying around fixing such issues. Unfortunately I
> cannot reliably reproduce this and booting a production machine
> with lockdep would be problematic.

Too bad because lockdep would definitively tell you what process is
holding the ldsem for this tty.
If you paste *all the tasks*, I can try to see if I recognize the task
stack that is holding the ldsem.

Regards,
Peter Hurley