Re: 2.6.31+2.6.31.4: XFS - All I/O locks up to D-state after 24-48hours (sysrq-t+w available)

From: Dave Chinner
Date: Sun Oct 18 2009 - 23:05:21 EST


On Sun, Oct 18, 2009 at 04:17:42PM -0400, Justin Piszcz wrote:
> It has happened again, all sysrq-X output was saved this time.
>
> wget http://home.comcast.net/~jpiszcz/20091018/crash.txt
> wget http://home.comcast.net/~jpiszcz/20091018/dmesg.txt
> wget http://home.comcast.net/~jpiszcz/20091018/interrupts.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-l.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-m.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-p.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-q.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-t.txt
> wget http://home.comcast.net/~jpiszcz/20091018/sysrq-w.txt
.....
> Again, some more D-state processes:
>
> [76325.608073] pdflush D 0000000000000001 0 362 2 0x00000000
> [76325.608087] Call Trace:
> [76325.608095] [<ffffffff811ea1c0>] ? xfs_trans_brelse+0x30/0x130
> [76325.608099] [<ffffffff811dc44c>] ? xlog_state_sync+0x26c/0x2a0
> [76325.608103] [<ffffffff810513e0>] ? default_wake_function+0x0/0x10
> [76325.608106] [<ffffffff811dc4d1>] ? _xfs_log_force+0x51/0x80
> [76325.608108] [<ffffffff811dc50b>] ? xfs_log_force+0xb/0x40
>
> [76325.608202] xfssyncd D 0000000000000000 0 831 2 0x00000000
> [76325.608214] Call Trace:
> [76325.608216] [<ffffffff811dc229>] ? xlog_state_sync+0x49/0x2a0
> [76325.608220] [<ffffffff811d3485>] ? __xfs_iunpin_wait+0x95/0xe0
> [76325.608222] [<ffffffff81069c20>] ? autoremove_wake_function+0x0/0x30
> [76325.608225] [<ffffffff811d566d>] ? xfs_iflush+0xdd/0x2f0
> [76325.608228] [<ffffffff811fbe28>] ? xfs_reclaim_inode+0x148/0x190
> [76325.608231] [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0
> [76325.608233] [<ffffffff811fc8dc>] ? xfs_inode_ag_walk+0x6c/0xc0
> [76325.608236] [<ffffffff811fbe70>] ? xfs_reclaim_inode_now+0x0/0xa0
>
> All of the D-state processes:

All pointing to log IO not completing.

That is, all of the D state processes are backed up on locks or
waiting for IO completion processing. A lot of the processes are
waiting for _xfs_log_force to complete, others are waiting for
inodes to be unpinned or are backed up behind locked inodes that are
waiting on log IO to complete before they can complete the
transaction and unlock the inode, and so on.

Unfortunately, the xfslogd and xfsdatad kernel threads are not
present in any of the output given, so I can't tell if these have
deadlocked themselves and caused the problem. However, my experience
with such pile-ups is that an I/O completion has not been run for
some reason and that is the cause of the problem. I don't know if
you can provide enough information to tell us if this happened or
not. Instead, do you have a test case that you can share?

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/