Re: [PATCH] mm: disallow direct reclaim page writeback

From: tytso
Date: Thu Apr 15 2010 - 13:51:58 EST


On Thu, Apr 15, 2010 at 09:42:17AM -0400, Chris Mason wrote:
> I'd like to add one more:
>
> 5. Don't dive into filesystem locks during reclaim.
>
> This is different from splicing code paths together, but
> the filesystem writepage code has become the center of our attempts at
> doing big fat contiguous writes on disk. We push off work as late as we
> can until just before the pages go down to disk.
>
> I'll pick on ext4 and btrfs for a minute, just to broaden the scope
> outside of XFS. Writepage comes along and the filesystem needs to
> actually find blocks on disk for all the dirty pages it has promised to
> write.
>
> So, we start a transaction, we take various allocator locks, modify
> different metadata, log changed blocks, take a break (logging is hard
> work you know, need_resched() triggered a by now), stuff it
> all into the file's metadata, log that, and finally return.
>
> Each of the steps above can block for a long time. Ext4 solves
> this by not doing them. ext4_writepage only writes pages that
> are already fully allocated on disk.
>
> Btrfs is much more efficient at not doing them, it just returns right
> away for PF_MEMALLOC.

This is a real problem, BTW. One of the problems we've been fighting
inside Google is because ext4_writepage() refuses to write pages that
are subject to delayed allocation, it can cause the OOM killer to get
invoked.

I had thought this was because of some evil games we're playing for
container support that makes zones small, but just last night at the
LF Collaboration Summit reception, I ran into a technologist from a
major financial industry customer reported to me that when they tried
using ext4, they ran into the exact same problem because they were
running Oracle which was pinning down 3 gigs of memory, and then when
they tried writing a very big file using ext4, they had the same
problem of writepage() not being able to reclaim enough pages, so the
kernel fell back to invoking the OOM killer, and things got ugly in a
hurry...

One of the things I was proposing internally to try as a long-term
we-gotta-fix writeback is that we need some kind of signal so that we
can do the lumpy reclaim (a) in a separate process, to avoid a lock
inversion problem and the gee-its-going-to-take-a-long-time problem
which Chris Mentioned, and (b) to try to cluster I/O so that we're not
dribbling out writes to the disk in small, seeky, 4k writes, which is
really a disaster from a performance standpoint. Maybe the VM guys
don't care about this, but this sort of things tends to get us
filesystem guys all up in a lather not just because of the really
sucky performance, but also because it tends to mean that the system
can thrash itself to death in low memory situations.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/