Re: Linux-next parallel cp workload hang

From: Xiong Zhou
Date: Wed May 18 2016 - 07:46:29 EST


Hi,

On Wed, May 18, 2016 at 07:54:09PM +1000, Dave Chinner wrote:
> On Wed, May 18, 2016 at 04:31:50PM +0800, Xiong Zhou wrote:
> > Hi,
> >
> > On Wed, May 18, 2016 at 03:56:34PM +1000, Dave Chinner wrote:
> > > On Wed, May 18, 2016 at 09:46:15AM +0800, Xiong Zhou wrote:
> > > > Hi,
> > > >
> > > > Parallel cp workload (xfstests generic/273) hangs like blow.
> > > > It's reproducible with a small chance, less the 1/100 i think.
> > > >
> > > > Have hit this in linux-next 20160504 0506 0510 trees, testing on
> > > > xfs with loop or block device. Ext4 survived several rounds
> > > > of testing.
> > > >
> > > > Linux next 20160510 tree hangs within 500 rounds testing several
> > > > times. The same tree with vfs parallel lookup patchset reverted
> > > > survived 900 rounds testing. Reverted commits are attached. >
> > > What hardware?
> >
> > A HP prototype host.
>
> description? cpus, memory, etc? I want to have some idea of what
> hardware I need to reproduce this...

#lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
Stepping: 2 CPU MHz: 2596.918
BogoMIPS: 5208.33
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 30720K
NUMA node0 CPU(s): 0-11,24-35
NUMA node1 CPU(s): 12-23,36-47

#free -m
total used free shared buff/cache available
Mem: 31782 623 27907 9 3251 30491
Swap: 10239 0 10239

>
> xfs_info from the scratch filesystem would also be handy.

meta-data=/dev/pmem1 isize=256 agcount=4, agsize=131072 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=0 finobt=0
data = bsize=4096 blocks=524288, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=0
log =internal bsize=4096 blocks=2560, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
>
> > > Can you reproduce this with CONFIG_XFS_DEBUG=y set? if you can, and
> > > it doesn't trigger any warnings or asserts, can you then try to
> > > reproduce it while tracing the following events:
> > >
> > > xfs_buf_lock
> > > xfs_buf_lock_done
> > > xfs_buf_trylock
> > > xfs_buf_unlock
> > >
> > > So we might be able to see if there's an unexpected buffer
> > > locking/state pattern occurring when the hang occurs?
> >
> > Yes, i've reproduced this with both CONFIG_XFS_DEBUG=y and the tracers
> > on. There are some trace output after hang for a while.
>
> I'm not actually interested in the trace after the hang - I'm
> interested in what happened leading up to the hang. The output
> you've given me tell me that the directory block at offset is locked
> but nothing in the trace tells me what locked it.
>
> Can I suggest using trace-cmd to record the events, then when the
> test hangs kill the check process so that trace-cmd terminates and
> gathers the events. Then dump the report to a text file and attach
> that?

Sure. Trace report, dmesg, ps axjf after Ctrl+C are attached.

Thanks for the instructions and patient.
Xiong
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx

Attachment: g273-trace-report.tar.gz
Description: application/gzip