Re: regression in page writeback
From: Wu Fengguang
Date: Tue Oct 06 2009 - 09:20:19 EST
On Tue, Oct 06, 2009 at 08:55:19PM +0800, Jan Kara wrote:
> On Fri 02-10-09 11:27:14, Wu Fengguang wrote:
> > On Fri, Oct 02, 2009 at 06:17:39AM +0800, Jan Kara wrote:
> > > On Wed 30-09-09 13:32:23, Wu Fengguang wrote:
> > > > writeback: bump up writeback chunk size to 128MB
> > > >
> > > > Adjust the writeback call stack to support larger writeback chunk size.
> > > >
> > > > - make wbc.nr_to_write a per-file parameter
> > > > - init wbc.nr_to_write with MAX_WRITEBACK_PAGES=128MB
> > > > (proposed by Ted)
> > > > - add wbc.nr_segments to limit seeks inside sparsely dirtied file
> > > > (proposed by Chris)
> > > > - add wbc.timeout which will be used to control IO submission time
> > > > either per-file or globally.
> > > >
> > > > The wbc.nr_segments is now determined purely by logical page index
> > > > distance: if two pages are 1MB apart, it makes a new segment.
> > > >
> > > > Filesystems could do this better with real extent knowledges.
> > > > One possible scheme is to record the previous page index in
> > > > wbc.writeback_index, and let ->writepage compare if the current and
> > > > previous pages lie in the same extent, and decrease wbc.nr_segments
> > > > accordingly. Care should taken to avoid double decreases in writepage
> > > > and write_cache_pages.
> > > >
> > > > The wbc.timeout (when used per-file) is mainly a safeguard against slow
> > > > devices, which may take too long time to sync 128MB data.
> > > >
> > > > The wbc.timeout (when used globally) could be useful when we decide to
> > > > do two sync scans on dirty pages and dirty metadata. XFS could say:
> > > > please return to sync dirty metadata after 10s. Would need another
> > > > b_io_metadata queue, but that's possible.
> > > >
> > > > This work depends on the balance_dirty_pages() wait queue patch.
> > > I don't know, I think it gets too complicated... I'd either use the
> > > segments idea or the timeout idea but not both (unless you can find real
> > > world tests in which both help).
> I'm sorry for a delayed reply but I had to work on something else.
>
> > Maybe complicated, but nr_segments and timeout each has their target
> > application. nr_segments serves two major purposes:
> > - fairness between two large files, one is continuously dirtied,
> > another is sparsely dirtied. Given the same amount of dirty pages,
> > it could take vastly different time to sync them to the _same_
> > device. The nr_segments check helps to favor continuous data.
> > - avoid seeks/fragmentations. To give each file fair chance of
> > writeback, we have to abort a file when some nr_to_write or timeout
> > is reached. However they are both not good abort conditions.
> > The best is for filesystem to abort earlier in seek boundaries,
> > and treat nr_to_write/timeout as large enough bottom lines.
> > timeout is mainly a safeguard in case nr_to_write is too large for
> > slow devices. It is not necessary if nr_to_write is auto-computed,
> > however timeout in itself serves as a simple throughput adapting
> > scheme.
> I understand why you have introduced both segments and timeout value
> and a completely agree with your reasons to introduce them. I just think
> that when the system gets too complex (there will be several independent
> methods of determining when writeback should be terminated, and even
> though each method is simple on its own, their interactions needn't be
> simple...) it will be hard to debug all the corner cases - even more
> because they will manifest "just" by slow or unfair writeback. So I'd
I definitely agree on the complications. There are some known issues
as well as possibly some corner cases to be discovered. One problem I
noticed now is, what if all the files are sparsely dirtied? Then
a small nr_segments can only hurt. Another problem is, the block
device file tend to have sparsely dirtied pages (with metadata on
them). Not sure how to detect/handle such conditions..
> prefer a single metric to determine when to stop writeback of an inode
> even though it might be a bit more complicated.
> For example terminating on writeout does not really get a file fair
> chance of writeback because it might have been blocked just because we were
> writing some heavily fragmented file just before. And your nr_segments
You mean timeout? I've dropped that idea in favor of an nr_to_write
adaptive to the bdi write speed :)
> check is just a rough guess of whether a writeback is going to be
> fragmented or not.
It could be made accurate if btrfs decreases it in its own writepages,
based on the extent info. Should also be possible for ext4.
> So I'd rather implement in mpage_ functions a proper detection of how
> fragmented the writeback is and give each inode a limit on number of
> fragments which mpage_ functions would obey. We could even use a queue's
> NONROT flag (set for solid state disks) to detect whether we should expect
> higher or lower seek times.
Yes, mpage_* can also utilize nr_segments.
Anyway nr_segments is not perfect, I'll post a patch and let fs
developers decide whether it is convenient/useful :)
Thanks,
Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/