Re: INFO: task hung in sync_blockdev

From: Jan Kara
Date: Thu Feb 08 2018 - 09:08:41 EST


On Thu 08-02-18 14:28:08, Dmitry Vyukov wrote:
> On Thu, Feb 8, 2018 at 10:28 AM, Jan Kara <jack@xxxxxxx> wrote:
> > On Wed 07-02-18 07:52:29, Andi Kleen wrote:
> >> > #0: (&bdev->bd_mutex){+.+.}, at: [<0000000040269370>]
> >> > __blkdev_put+0xbc/0x7f0 fs/block_dev.c:1757
> >> > 1 lock held by blkid/19199:
> >> > #0: (&bdev->bd_mutex){+.+.}, at: [<00000000b4dcaa18>]
> >> > __blkdev_get+0x158/0x10e0 fs/block_dev.c:1439
> >> > #1: (&ldata->atomic_read_lock){+.+.}, at: [<0000000033edf9f2>]
> >> > n_tty_read+0x2ef/0x1a00 drivers/tty/n_tty.c:2131
> >> > 1 lock held by syz-executor5/19330:
> >> > #0: (&bdev->bd_mutex){+.+.}, at: [<00000000b4dcaa18>]
> >> > __blkdev_get+0x158/0x10e0 fs/block_dev.c:1439
> >> > 1 lock held by syz-executor5/19331:
> >> > #0: (&bdev->bd_mutex){+.+.}, at: [<00000000b4dcaa18>]
> >> > __blkdev_get+0x158/0x10e0 fs/block_dev.c:1439
> >>
> >> It seems multiple processes deadlocked on the bd_mutex.
> >> Unfortunately there's no backtrace for the lock acquisitions,
> >> so it's hard to see the exact sequence.
> >
> > Well, all in the report points to a situation where some IO was submitted
> > to the block device and never completed (more exactly it took longer than
> > those 120s to complete that IO). It would need more digging into the
> > syzkaller program to find out what kind of device that was and possibly why
> > the IO took so long to complete...
>
>
> Would a traceback of all task stacks help in this case?
> What I've seen in several "task hung" reports is that the CPU
> traceback is not showing anything useful. So perhaps it should be
> changed to task traceback? Or it would not help either?

Task stack traceback for all tasks (usually only tasks in D state - i.e.
sysrq-w - are enough actually) would definitely help for debugging
deadlocks on sleeping locks. For this particular case I'm not sure if it
would help or not since it is quite possible the IO is just sitting in some
queue never getting processed due to some racing syzkaller process tearing
down the device in the wrong moment or something like that... Such case is
very difficult to debug without full kernel crashdump of the hung kernel
(or a reproducer for that matter) and even with that it is usually rather
time consuming. But for the deadlocks which do occur more frequently it
would be probably worth the time so it would be nice if such option was
eventually available.

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR