[regression,bisected] 2.6.32.12: find(1) on xfs causes OOM

From: Peter Palfrader
Date: Mon May 03 2010 - 08:01:28 EST


Hi,

I have an xfs filesystem in a KVM domain with 512megs of memory and 2 gigs of
swap.

The filesystem is 750g in size, of which some 500g are in use in about 6
million files. (This XFS filesystem is exported via nfs4. I haven't tested if
this makes any difference.)

Starting in 2.6.32.12 running something like "find | wc -l" on this
filesystem's mountpoint causes the OOM killer to kill off most of the
system. (See kern.log[1])

With 2.6.32.11 the system does not behave like this.

Bisecting turned up the following commit. Reverting it in 2.6.32.12
also results in a system that works.

| 9e1e9675fb29c0e94a7c87146138aa2135feba2f is first bad commit
| commit 9e1e9675fb29c0e94a7c87146138aa2135feba2f
| Author: Dave Chinner <david@xxxxxxxxxxxxx>
| Date: Fri Mar 12 09:42:10 2010 +1100
|
| xfs: reclaim all inodes by background tree walks
|
| commit 57817c68229984818fea9e614d6f95249c3fb098 upstream
|
| We cannot do direct inode reclaim without taking the flush lock to
| ensure that we do not reclaim an inode under IO. We check the inode
| is clean before doing direct reclaim, but this is not good enough
| because the inode flush code marks the inode clean once it has
| copied the in-core dirty state to the backing buffer.
|
| It is the flush lock that determines whether the inode is still
| under IO, even though it is marked clean, and the inode is still
| required at IO completion so we can't reclaim it even though it is
| clean in core. Hence the requirement that we need to take the flush
| lock even on clean inodes because this guarantees that the inode
| writeback IO has completed and it is safe to reclaim the inode.
|
| With delayed write inode flushing, we could end up waiting a long
| time on the flush lock even for a clean inode. The background
| reclaim already handles this efficiently, so avoid all the problems
| by killing the direct reclaim path altogether.
|
| Signed-off-by: Dave Chinner <david@xxxxxxxxxxxxx>
| Reviewed-by: Christoph Hellwig <hch@xxxxxx>
| Signed-off-by: Alex Elder <aelder@xxxxxxx>
| Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxx>
|
| diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
| index a82a93d..ea7a59a 100644
| --- a/fs/xfs/linux-2.6/xfs_super.c
| +++ b/fs/xfs/linux-2.6/xfs_super.c
| @@ -953,16 +953,14 @@ xfs_fs_destroy_inode(
| ASSERT_ALWAYS(!xfs_iflags_test(ip, XFS_IRECLAIM));
|
| /*
| - * If we have nothing to flush with this inode then complete the
| - * teardown now, otherwise delay the flush operation.
| + * We always use background reclaim here because even if the
| + * inode is clean, it still may be under IO and hence we have
| + * to take the flush lock. The background reclaim path handles
| + * this more efficiently than we can here, so simply let background
| + * reclaim tear down all inodes.
| */
| - if (!xfs_inode_clean(ip)) {
| - xfs_inode_set_reclaim_tag(ip);
| - return;
| - }
| -
| out_reclaim:
| - xfs_ireclaim(ip);
| + xfs_inode_set_reclaim_tag(ip);
| }
|
| /*


Cheers,
Peter

1. http://asteria.noreply.org/~weasel/volatile/2010-05-03-Aju29kSrm0A/kern.log
2. http://asteria.noreply.org/~weasel/volatile/2010-05-03-Aju29kSrm0A/config-2.6.32.12-dsa-amd64
--
| .''`. ** Debian GNU/Linux **
Peter Palfrader | : :' : The universal
http://www.palfrader.org/ | `. `' Operating System
| `- http://www.debian.org/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/