Re: [PATCH] Improve buffered streaming write ordering

From: Dave Chinner
Date: Fri Oct 10 2008 - 01:13:58 EST


On Thu, Oct 09, 2008 at 11:11:20AM -0400, Chris Mason wrote:
> On Fri, 2008-10-03 at 09:43 +1000, Dave Chinner wrote:
> > On Thu, Oct 02, 2008 at 11:48:56PM +0530, Aneesh Kumar K.V wrote:
> > > On Thu, Oct 02, 2008 at 08:20:54AM -0400, Chris Mason wrote:
> > > > On Wed, 2008-10-01 at 21:52 -0700, Andrew Morton wrote:
> > > > For a 4.5GB streaming buffered write, this printk inside
> > > > ext4_da_writepage shows up 37,2429 times in /var/log/messages.
> > > >
> > >
> > > Part of that can happen due to shrink_page_list -> pageout -> writepagee
> > > call back with lots of unallocated buffer_heads(blocks).
> >
> > Quite frankly, a simple streaming buffered write should *never*
> > trigger writeback from the LRU in memory reclaim. That indicates
> > that some feedback loop has broken down and we are not cleaning
> > pages fast enough or perhaps in the correct order. Page reclaim in
> > this case should be reclaiming clean pages (those that have already
> > been written back), not writing back random dirty pages.
>
> Here are some go faster stripes for the XFS buffered writeback. This
> patch has a lot of debatable features to it, but the idea is to show
> which knobs are slowing us down today.
>
> The first change is to avoid calling balance_dirty_pages_ratelimited on
> every page. When we know we're doing a largeish write it makes more
> sense to balance things less often. This might just mean our
> ratelimit_pages magic value is too small.

Ok, so how about doing something like this to reduce the
number of balances on large writes, but causing at least one
balance call for every write that occurs:

int nr = 0;
.....
while() {
....
if (!(nr % 256)) {
/* do balance */
}
nr++;
....
}

That way you get a balance on the first page on every write,
but then hold off balancing on that write again for some
number of pages.

> The second change makes xfs bump wbc->nr_to_write (suggested by
> Christoph), which probably makes delalloc go in bigger chunks.

Hmmmm. Reasonable theory. We used to do gigantic delalloc extents -
we paid no attention to congestion and could allocate and write
several GB at a time. Latency was an issue, though, so it got
changed to be bound by nr_to_write.

I guess we need to be issuing larger allocations. Can you remove
you patches and see what effect using the allocsize mount
option has on throughput? This changes the default delalloc EOF
preallocation size, which means more or less allocations. The
default is 64k and it can go as high as 1GB, IIRC.

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/