Re: regression in page writeback

From: Wu Fengguang
Date: Fri Oct 02 2009 - 04:21:17 EST


On Fri, Oct 02, 2009 at 10:55:02AM +0800, Wu Fengguang wrote:
> On Fri, Oct 02, 2009 at 05:54:38AM +0800, Theodore Ts'o wrote:
> > On Thu, Oct 01, 2009 at 11:14:29PM +0800, Wu Fengguang wrote:
> > > Yes and no. Yes if the queue was empty for the slow device. No if the
> > > queue was full, in which case IO submission speed = IO complete speed
> > > for previously queued requests.
> > >
> > > So wbc.timeout will be accurate for IO submission time, and mostly
> > > accurate for IO completion time. The transient queue fill up phase
> > > shall not be a big problem?
> >
> > So the problem is if we have a mixed workload where there are lots
> > large contiguous writes, and lots of small writes which are fsync'ed()
> > --- for example, consider the workload of copying lots of big DVD
> > images combined with the infamous firefox-we-must-write-out-300-megs-of-
> > small-random-writes-and-then-fsync-them-on-every-single-url-click-so-
> > that-every-last-visited-page-is-preserved-for-history-bar-autocompletion
> > workload. The big writes, if the are contiguous, could take 1-2 seconds
> > on a very slow, ancient laptop disk, and that will hold up any kind of
> > small synchornous activities --- such as either a disk read or a firefox-
> > triggered fsync().
>
> Yes, that's a problem. The SYNC/ASYNC elevator queues can help here.
>
> In IO submission paths, fsync writes will not be blocked by non-sync
> writes because __filemap_fdatawrite_range() starts foreground sync
> for the inode.

> Without the congestion backoff, it will now have to
> compete queue with bdi-flush. Should not be a big problem though.

I'd like to correct this: get_request_wait() uses one queue for SYNC
rw and another for ASYNC rw. So fsync won't compete the request queue
with background flush. That's perfect: when fsync comes, CFQ will
honor it a green channel, and somehow block background flushes.

> There's still the problem of IO submission time != IO completion time,
> due to fluctuations of randomness and more. However that's a general
> and unavoidable problem. Both the wbc.timeout scheme and the
> "wbc.nr_to_write based on estimated throughput" scheme are based on
> _past_ requests and it's simply impossible to have a 100% accurate
> scheme. In principle, wbc.timeout will only be inferior at IO startup
> time. In the steady state of 100% full queue, it is actually estimating
> the IO throughput implicitly :)

Another difference between wbc.timeout and adaptive wbc.nr_to_write
is, when there comes many _read_ requests or fsync, these SYNC rw
requests will significant lower the ASYNC writeback throughput, if
it's not completely stalled. So with timeout, the inode will be
aborted with few pages written; with nr_to_write, the inode will be
written a good number of pages, at the cost of taking up long time.

IMHO the nr_to_write behavior seems more efficient. What do you think?

Thanks,
Fengguang

> > That's why the IO completion time matters; it causes latency problems
> > for slow disks and mixed large and small write workloads. It was the
> > original reason for the 1024 MAX_WRITEBACK_PAGES, which might have
> > made sense 10 years ago back when disks were a lot slower. One of the
> > advantages of an auto-tuning algorithm, beyond auto-adjusting for
> > different types of hardware, is that we don't need to worry about
> > arbitrary and magic caps beocoming obsolete due to technological
> > changes. :-)
>
> Yeah, I'm a big fan of auto-tuning :)
>
> Thanks,
> Fengguang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/