Re: [PATCH][5/?] count writeback pages in nr_scanned

From: Andrew Morton
Date: Thu Jan 06 2005 - 00:34:54 EST


Nick Piggin <nickpiggin@xxxxxxxxxxxx> wrote:
>
> Andrea Arcangeli wrote:
> > On Wed, Jan 05, 2005 at 09:05:39PM -0800, Andrew Morton wrote:
> >
> >>Andrea Arcangeli <andrea@xxxxxxx> wrote:
> >>
> >>>The fix is very simple and it is to call wait_on_page_writeback on one
> >>> of the pages under writeback.
> >>
> >>eek, no. That was causing waits of five seconds or more. Fixing this
> >>caused the single greatest improvement in page allocator latency in early
> >>2.5. We're totally at the mercy of the elevator algorithm this way.
> >>
> >>If we're to improve things in there we want to wait on _any_ eligible page
> >>becoming reclaimable, not on a particular page.
> >
> >
> > I told you one way to fix it. I didn't guarantee it was the most
> > efficient one.
> >

And I've already described the efficient way to "fix" it. Twice.

> > I sure agree waiting on any page to complete writeback is going to fix
> > it too. Exactly because this page was a "random" page anyway.
> >
> > Still my point is that this is a bug, and I prefer to be slow and safe
> > like 2.4, than fast and unreliable like 2.6.
> >
> > The slight improvement you suggested of waiting on _any_ random
> > PG_writeback to go away (instead of one particular one as I did in 2.4)

It's a HUGE improvement.

"Example: with `mem=512m', running 4 instances of `dbench 100', 2.5.34
took 35 minutes to compile a kernel. With this patch, it took three
minutes, 45 seconds."

Plus this was the change which precipitated all the I/O scheduler
development, because it caused us to keep the queues full all the time and
the old I/O scheduler collapsed.

> > is going to fix the write throttling equally too as well as the 2.4
> > logic, but without introducing slowdown that 2.4 had.
> >
> > It's easy to demonstrate: exactly because the page we pick is random
> > anyway, we can pick the first random one that has seen PG_writeback
> > transitioning from 1 to 0. The guarantee we get is the same in terms of
> > safety of the write throttling, but we also guarantee the best possible
> > latency this way. And the HZ/x hacks to avoid deadlocks will magically
> > go away too.
> >
>
> This is practically what blk_congestion_wait does when the queue
> isn't congested though, isn't it?

Pretty much. Except:

- Doing a wakeup when a write request is retired corresponds to releasing
a batch of pages, not a single page. Usually.

- direct-io writes could confuse it.

For the third time: "fixing" this involves delivering a wakeup to all zones
in the page's classzone in end_page_writeback(), and passing the zone* into
blk_congestion_wait(). Only deliver the wakeup on every Nth page to get a
bit of batching and to reduce CPU consumption. Then demonstrating that the
change actually improves something.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/