Re: [PATCH 5/5] writeback: introduce writeback_control.more_io toindicate more io

From: Fengguang Wu
Date: Thu Oct 04 2007 - 23:37:12 EST


On Thu, Oct 04, 2007 at 03:03:44PM +1000, David Chinner wrote:
> On Thu, Oct 04, 2007 at 10:21:33AM +0800, Fengguang Wu wrote:
> > On Wed, Oct 03, 2007 at 12:41:19PM +1000, David Chinner wrote:
> > > On Wed, Oct 03, 2007 at 09:34:39AM +0800, Fengguang Wu wrote:
> > > > On Wed, Oct 03, 2007 at 07:47:45AM +1000, David Chinner wrote:
> > > > > On Tue, Oct 02, 2007 at 04:41:48PM +0800, Fengguang Wu wrote:
> > > > > > wbc.pages_skipped = 0;
> > > > > > @@ -560,8 +561,9 @@ static void background_writeout(unsigned
> > > > > > min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
> > > > > > if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
> > > > > > /* Wrote less than expected */
> > > > > > - congestion_wait(WRITE, HZ/10);
> > > > > > - if (!wbc.encountered_congestion)
> > > > > > + if (wbc.encountered_congestion || wbc.more_io)
> > > > > > + congestion_wait(WRITE, HZ/10);
> > > > > > + else
> > > > > > break;
> > > > > > }
> > > > >
> > > > > Why do you call congestion_wait() if there is more I/O to issue? If
> > > > > we have a fast filesystem, this might cause the device queues to
> > > > > fill, then drain on congestion_wait(), then fill again, etc. i.e. we
> > > > > will have trouble keeping the queues full, right?
> > > >
> > > > You mean slow writers and fast RAID? That would be exactly the case
> > > > these patches try to improve.
> > >
> > > I mean any writers and a fast block device (raid or otherwise).
> > >
> > > > This patchset makes kupdate/background writeback more responsible,
> > > > so that if (avg-write-speed < device-capabilities), the dirty data are
> > > > synced timely, and we don't have to go for balance_dirty_pages().
> > >
> > > Sure, but I'm asking about the effect of the patches on the
> > > (avg-write-speed == device-capabilities) case. I agree that
> > > they are necessary for timely syncing of data but I'm trying
> > > to understand what effect they have on the normal write case
> >
> > > (i.e. keeping the disk at full write throughput).
> >
> > OK, I guess it is the focus of all your questions: Why should we sleep
> > in congestion_wait() and possibly hurt the write throughput? I'll try
> > to summary it:
> >
> > - congestion_wait() is necessary
> > Besides device congestions, there may be other blockades we have to
> > wait on, e.g. temporary page locks, NFS/journal issues(I guess).
>
> We skip locked pages in writeback, and if some filesystems have
> blocking issues that require non-blocking writeback waits for some
> I/O to complete before re-entering writeback, then perhaps they should be
> setting wbc->encountered_congestion to tell writeback to back off.

We have wbc->pages_skipped for that :-)

> The question I'm asking is that if more_io tells us we have more
> work to do, why do we have to sleep first if the block dev is
> able to take more I/O?

See below.

> >
> > - congestion_wait() is called only when necessary
> > congestion_wait() will only be called we saw blockades:
> > if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
> > congestion_wait(WRITE, HZ/10);
> > }
> > So in normal case, it may well write 128MB data without any waiting.
>
> Sure, but wbc.more_io doesn't indicate a blockade - just that there
> is more work to do, right?

It's not wbc.more_io, but the context(wbc.pages_skipped > 0) indicates
a blockade:

if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) { /* all-written or blockade... */
if (wbc.encountered_congestion || wbc.more_io) /* blockade! */
congestion_wait(WRITE, HZ/10);
else /* all-written! */
break;
}

We can also read the whole background_writeout() logic as

while (!done) {
/* sync _all_ sync-able data */
congestion_wait(100ms);
}

And an example run could be:

sync 1000MB, skipped 100MB
congestion_wait(100ms);
sync 100MB, skipped 10MB
congestion_wait(100ms);
sync 10MB, all done

Note that it's far from "wait 100ms for every 4MB" (which is merely
the worst possible case).

> > - congestion_wait() won't hurt write throughput
> > When not congested, congestion_wait() will be wake up on each write
> > completion.
>
> What happens if the I/O we issued has already completed before we
> got back up to the congestion_wait() call? We'll spend 100ms
> sleeping when we shouldn't have and throughput goes down by 10% on
> every occurrence....

Ah, that was out of my imagination. Maybe we could do with

if (wbc.more_io)
congestion_wait(WRITE, 1);

It's at least 10 times better.

> if we've got more work to do, then we should do it without an
> arbitrary, non-deterministic delay being inserted. If the delay is
> needed to prevent he system from "going mad" (whatever tht means),
> then what's the explaination for the system "going mad"?

"going mad" means "busy waiting".

> > Note that MAX_WRITEBACK_PAGES=1024 and
> > /sys/block/sda/queue/max_sectors_kb=512(for me),
> > which means we are gave the chance to sync 4MB on every 512KB written,
> > which means we are able to submit write IOs 8 times faster than the
> > device capability. congestion_wait() is a magical timer :-)
>
> So, with Jens Axboe's sglist chaining, that single I/O could now
> be up to 32MB on some hardware. IOWs, we push 1024 pages, and that
> could end up as a single I/O being issued to disk.
>
> Your magic just broke. :/

Hmm, congestion_wait(WRITE, 1) could re-establish the balance ;-)
Which waits <10ms for HZ=100.

> > > > So for your question of queue depth, the answer is: the queue length
> > > > will not build up in the first place.
> > >
> > > Which queue are you talking about here? The block deivce queue?
> >
> > Yes, the elevator's queues.
>
> I think this is the wrong thing to be doing and is detrimental
> to I/o perfomrance because it wil reduce elevator efficiency.
>
> The elevator can only work efficiently if we allow the queues to
> build up. The deeper the queue, the better the elevator can sort the
> I/o requests and keep the device at maximum efficiency. If we don't
> push enough I/O into the queues the we miss opportunities to combine
> adjacent I/Os and reduce the seek load of writeback. Also, a shallow
> queue will run dry if we don't get back to it in time which is
> possible if we wait for I/o to complete before we go and flush
> more....

Sure, the queues should be filled as fast as possible.
How fast can we fill the queue? Let's measure it:

//generated by the patch below

[ 871.430700] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 54289 global 29911 0 0 wc _M tw -12 sk 0
[ 871.444718] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 53253 global 28857 0 0 wc _M tw -12 sk 0
[ 871.458764] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 52217 global 27834 0 0 wc _M tw -12 sk 0
[ 871.472797] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 51181 global 26780 0 0 wc _M tw -12 sk 0
[ 871.486825] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 50145 global 25757 0 0 wc _M tw -12 sk 0
[ 871.500857] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 49109 global 24734 0 0 wc _M tw -12 sk 0
[ 871.514864] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 48073 global 23680 0 0 wc _M tw -12 sk 0
[ 871.528889] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 47037 global 22657 0 0 wc _M tw -12 sk 0
[ 871.542894] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 46001 global 21603 0 0 wc _M tw -12 sk 0
[ 871.556927] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 44965 global 20580 0 0 wc _M tw -12 sk 0
[ 871.570961] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 43929 global 19557 0 0 wc _M tw -12 sk 0
[ 871.584992] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 42893 global 18503 0 0 wc _M tw -12 sk 0
[ 871.599005] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 41857 global 17480 0 0 wc _M tw -12 sk 0
[ 871.613027] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 40821 global 16426 0 0 wc _M tw -12 sk 0
[ 871.628626] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 39785 global 15403 961 0 wc _M tw -12 sk 0
[ 871.644439] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 38749 global 14380 1550 0 wc _M tw -12 sk 0
[ 871.660267] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 37713 global 13326 2573 0 wc _M tw -12 sk 0
[ 871.676236] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 36677 global 12303 3224 0 wc _M tw -12 sk 0
[ 871.692021] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 35641 global 11280 4154 0 wc _M tw -12 sk 0
[ 871.707824] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 34605 global 10226 4929 0 wc _M tw -12 sk 0
[ 871.723638] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 33569 global 9203 5735 0 wc _M tw -12 sk 0
[ 871.739708] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 32533 global 8149 6603 0 wc _M tw -12 sk 0
[ 871.756407] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 31497 global 7126 7409 0 wc _M tw -12 sk 0
[ 871.772165] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 30461 global 6103 8246 0 wc _M tw -12 sk 0
[ 871.788035] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 29425 global 5049 9052 0 wc _M tw -12 sk 0
[ 871.803896] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 28389 global 4026 9982 0 wc _M tw -12 sk 0
[ 871.820427] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 27353 global 2972 10757 0 wc _M tw -12 sk 0
[ 871.836728] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 26317 global 1949 11656 0 wc _M tw -12 sk 0
[ 871.853286] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 25281 global 895 12431 0 wc _M tw -12 sk 0
[ 871.868762] mm/page-writeback.c 668 wb_kupdate: pdflush(202) 24245 global 58 13051 0 wc __ tw 168 sk 0

It's an Intel Core 2 2.93GHz CPU and a SATA disk.
The trace shows that
- there's no congestion_wait() called in wb_kupdate()
- it takes wb_kupdate() ~15ms to sync every 4MB

So I guess congestion_wait(WRITE, 1) will be more than enough.

However, wb_kupdate() is syncing the data a bit slow(4*1000/15=266MB/s),
could it be because of a lot of cond_resched()?


Fengguang
---
mm/page-writeback.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)

--- linux-2.6.23-rc8-mm2.orig/mm/page-writeback.c
+++ linux-2.6.23-rc8-mm2/mm/page-writeback.c
@@ -98,6 +98,26 @@ EXPORT_SYMBOL(laptop_mode);

/* End of sysctl-exported parameters */

+#define writeback_debug_report(n, wbc) do { \
+ __writeback_debug_report(n, wbc, __FILE__, __LINE__, __FUNCTION__); \
+} while (0)
+
+void __writeback_debug_report(long n, struct writeback_control *wbc,
+ const char *file, int line, const char *func)
+{
+ printk("%s %d %s: %s(%d) %ld "
+ "global %lu %lu %lu "
+ "wc %c%c tw %ld sk %ld\n",
+ file, line, func,
+ current->comm, current->pid, n,
+ global_page_state(NR_FILE_DIRTY),
+ global_page_state(NR_WRITEBACK),
+ global_page_state(NR_UNSTABLE_NFS),
+ wbc->encountered_congestion ? 'C':'_',
+ wbc->more_io ? 'M':'_',
+ wbc->nr_to_write,
+ wbc->pages_skipped);
+}

static void background_writeout(unsigned long _min_pages);

@@ -404,6 +424,7 @@ static void balance_dirty_pages(struct a
pages_written += write_chunk - wbc.nr_to_write;
get_dirty_limits(&background_thresh, &dirty_thresh,
&bdi_thresh, bdi);
+ writeback_debug_report(pages_written, &wbc);
}

/*
@@ -568,6 +589,7 @@ static void background_writeout(unsigned
wbc.pages_skipped = 0;
writeback_inodes(&wbc);
min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
+ writeback_debug_report(min_pages, &wbc);
if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
/* Wrote less than expected */
if (wbc.encountered_congestion)
@@ -643,6 +665,7 @@ static void wb_kupdate(unsigned long arg
wbc.encountered_congestion = 0;
wbc.nr_to_write = MAX_WRITEBACK_PAGES;
writeback_inodes(&wbc);
+ writeback_debug_report(nr_to_write, &wbc);
if (wbc.nr_to_write > 0) {
if (wbc.encountered_congestion)
congestion_wait(WRITE, HZ/10);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/