Linux-next parallel cp workload hang

From: Xiong Zhou
Date: Tue May 17 2016 - 21:46:24 EST


Hi,

Parallel cp workload (xfstests generic/273) hangs like blow.
It's reproducible with a small chance, less the 1/100 i think.

Have hit this in linux-next 20160504 0506 0510 trees, testing on
xfs with loop or block device. Ext4 survived several rounds
of testing.

Linux next 20160510 tree hangs within 500 rounds testing several
times. The same tree with vfs parallel lookup patchset reverted
survived 900 rounds testing. Reverted commits are attached.

Bisecting in this patchset ided this commit:

3b0a3c1ac1598722fc289da19219d14f2a37b31f is the first bad commit
commit 3b0a3c1ac1598722fc289da19219d14f2a37b31f
Author: Al Viro <viro@xxxxxxxxxxxxxxxxxx>
Date: Wed Apr 20 23:42:46 2016 -0400

simple local filesystems: switch to ->iterate_shared()

no changes needed (XFS isn't simple, but it has the same parallelism
in the interesting parts exercised from CXFS).

With this commit reverted on top of Linux next 0510 tree, 5000+ rounds
of testing passed.

Although 2000 rounds testing had been conducted before good/bad
verdict, i'm not 100 percent sure about all this, since it's
so hard to hit, and i am not that lucky..

Bisect log and full blocked state process dump log are also attached.

Furthermore, this was first hit when testing fs dax on nvdimm,
however it's reproducible without dax mount option, and also
reproducible on loop device, just seems harder to hit.

Thanks,
Xiong

[0.771475] INFO: task cp:49033 blocked for more than 120 seconds.
[0.794263] Not tainted 4.6.0-rc6-next-20160504 #5
[0.812515] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[0.841801] cp D ffff880b4e977928 0 49033 49014
0x00000080
[0.868923] ffff880b4e977928 ffff880ba275d380 ffff880b8d712b80
ffff880b4e978000
[0.897504] 7fffffffffffffff 0000000000000002 0000000000000000
ffff880b8d712b80
[0.925234] ffff880b4e977940 ffffffff816cbc25 ffff88035a1dabb0
ffff880b4e9779e8
[0.953237] Call Trace:
[0.958314] [<ffffffff816cbc25>] schedule+0x35/0x80
[0.974854] [<ffffffff816ceb71>] schedule_timeout+0x231/0x2d0
[0.995728] [<ffffffff810d2e2d>] ? down_trylock+0x2d/0x40
[1.015351] [<ffffffffa03fa6c2>] ? xfs_iext_bno_to_ext+0xa2/0x190 [xfs]
[1.040182] [<ffffffff816cdb89>] __down_common+0xaa/0x104
[1.059021] [<ffffffffa04076d2>] ? _xfs_buf_find+0x162/0x340 [xfs]
[1.081357] [<ffffffff816cdc00>] __down+0x1d/0x1f
[1.097166] [<ffffffff810d2e81>] down+0x41/0x50
[1.112869] [<ffffffffa04074bc>] xfs_buf_lock+0x3c/0xf0 [xfs]
[1.134504] [<ffffffffa04076d2>] _xfs_buf_find+0x162/0x340 [xfs]
[1.156871] [<ffffffffa04078da>] xfs_buf_get_map+0x2a/0x270 [xfs]
[1.180010] [<ffffffffa04084ed>] xfs_buf_read_map+0x2d/0x180 [xfs]
[1.203538] [<ffffffffa0435f21>] xfs_trans_read_buf_map+0xf1/0x300 [xfs]
[1.229194] [<ffffffffa03e75b1>] xfs_da_read_buf+0xd1/0x100 [xfs]
[1.251948] [<ffffffffa03ed5e6>] xfs_dir3_data_read+0x26/0x60 [xfs]
[1.275736] [<ffffffffa040923e>]
xfs_dir2_leaf_readbuf.isra.12+0x1be/0x4a0 [xfs]
[1.305094] [<ffffffff816cdc92>] ? down_read+0x12/0x30
[1.323787] [<ffffffffa04187a4>] ? xfs_ilock+0xe4/0x110 [xfs]
[1.345114] [<ffffffffa040965b>] xfs_dir2_leaf_getdents+0x13b/0x3d0
[xfs]
[1.371818] [<ffffffffa0409f26>] xfs_readdir+0x1a6/0x1c0 [xfs]
[1.393471] [<ffffffffa040bcbb>] xfs_file_readdir+0x2b/0x30 [xfs]
[1.416874] [<ffffffff8122cec3>] iterate_dir+0x173/0x190
[1.436709] [<ffffffff81003276>] ? do_audit_syscall_entry+0x66/0x70
[1.460951] [<ffffffff8122d338>] SyS_getdents+0x98/0x120
[1.480566] [<ffffffff8122cee0>] ? iterate_dir+0x190/0x190
[1.500909] [<ffffffff81003b12>] do_syscall_64+0x62/0x110
[1.520847] [<ffffffff816cf9a1>] entry_SYSCALL64_slow_path+0x25/0x25
[1.545372] INFO: task cp:49040 blocked for more than 120 seconds.
[1.568933] Not tainted 4.6.0-rc6-next-20160504 #5
[1.587943] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
[1.618544] cp D ffff880b91463b00 0 49040 49016
0x00000080
[1.645502] ffff880b91463b00 ffff880464d5c140 ffff88029b475700
ffff880b91464000
[1.674145] ffff880411c42610 ffffffffffffffff ffff880411c42628
ffff8802c10bc610
[1.702834] ffff880b91463b18 ffffffff816cbc25 ffff88029b475700
ffff880b91463b88
[1.731501] Call Trace:
[1.736866] [<ffffffff816cbc25>] schedule+0x35/0x80
[1.754119] [<ffffffff816ce3c2>] rwsem_down_read_failed+0xf2/0x140
[1.777411] [<ffffffffa0418800>] ? xfs_ilock_data_map_shared+0x30/0x40
[xfs]
[1.805090] [<ffffffff81355e08>] call_rwsem_down_read_failed+0x18/0x30
[1.830482] [<ffffffff816cdca0>] down_read+0x20/0x30
[1.848505] [<ffffffffa04187a4>] xfs_ilock+0xe4/0x110 [xfs]
[1.869293] [<ffffffffa0418800>] xfs_ilock_data_map_shared+0x30/0x40
[xfs]
[1.896775] [<ffffffffa040bc60>] xfs_dir_open+0x30/0x60 [xfs]
[1.917882] [<ffffffff8121603f>] do_dentry_open+0x20f/0x320
[1.938919] [<ffffffffa040bc30>] ? xfs_file_mmap+0x50/0x50 [xfs]
[1.961532] [<ffffffff81217487>] vfs_open+0x57/0x60
[1.978945] [<ffffffff81226f65>] path_openat+0x325/0x14e0
[1.999273] [<ffffffff81228533>] ? putname+0x53/0x60
[2.017695] [<ffffffff81229581>] do_filp_open+0x91/0x100
[2.036893] [<ffffffff81237866>] ? __alloc_fd+0x46/0x180
[2.055479] [<ffffffff812177e4>] do_sys_open+0x124/0x210
[2.073783] [<ffffffff8113110b>] ? __audit_syscall_exit+0x1db/0x260
[2.096426] [<ffffffff81217904>] SyS_openat+0x14/0x20
[2.113690] [<ffffffff81003b12>] do_syscall_64+0x62/0x110
[2.132417] [<ffffffff816cf9a1>] entry_SYSCALL64_slow_path+0x25/0x25

Attachment: g273-block-dumps.tar.gz
Description: application/gzip