Re: Hung task - sync - 2.6.33-rc7 w/md6 multicore rebuild in process

From: Michael Breuer
Date: Fri Feb 19 2010 - 00:31:15 EST


On 2/18/2010 11:02 PM, Dave Chinner wrote:
On Thu, Feb 18, 2010 at 09:31:41PM -0500, Michael Breuer wrote:
On 2/18/2010 8:43 PM, Dave Chinner wrote:

This is probably where the barrier IOs are coming from. With a RAID
resync going on (so all IO is going to be slow to begin with) and
writeback is causing barriers to be issued (which are really slow on
software RAID5/6), having sync take so long is not out of the
question if you have lots of dirty inodes to write back. A kernel
compile will generate lots of dirty inodes.

Even taking the barrier IOs out of the question, I've seen reports
of sync or unmount taking over 10 hours to complete on software
RAID5 because there were hundreds of thousands of dirty inodes to
write back and each inode being written back caused a synchronous
RAID5 RMW cycle to occur. Hence writeback could only clean 50
inodes/sec because as soon as RMW cycles RAID5/6 devices start
they go slower than single spindle devices. This sounds very
similar to what you are seeing here,

i.e. The reports don't indicate to me that there is a bug in the
writeback code, just your disk subsystem has very, very low
throughput in these conditions....
Probably true... and the system does recover. The only thing I'd point
out is that the subsystem isn't (or perhaps shouldn't) be this sluggish.
I hypothesize that the low throughput under these condition is a result
of:
1) multicore raid support (pushing the resync at higher rates)
Possibly, though barrier support for RAID5/6 is shiny new as well.

2) time spent in fs cache reclaim. The sync slowdown only occurs when fs
cache is in heavy (10Gb) use.
Not surprising ;)

I actually could not recreate the issue until I did a grep -R foo /usr/
/dev/null to force high fs cache utilization. For what it's worth, two
kernel rebuilds (many dirty inodes) and then a sync with about 12Mb
dirty (/proc/meminfo) didn't cause an issue. The issue only happens when
fs cache is heavily used. I also never saw this before enabling
multicore raid.
"grep -R foo /usr/" will dirty every inode that touchs (atime) and
they have to be written back out. That's almost certainly creating
more dirty inodes than a kernel build - there are about 400,000
inodes under /usr on my system. That would be enough to trigger very
long sync times if inode writeback is slow.

Cheers,

Dave.

My filesystems are mounted relatime. Just confirmed that dirty pages doesn't climb all that much with the grep -R foo /usr > /dev/null. The only apparant impact is to fs cache.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/