Re: [RFC PATCH 0/3] Avoid the use of congestion_wait under zonepressure

From: Mel Gorman
Date: Tue Mar 16 2010 - 06:12:13 EST


On Mon, Mar 15, 2010 at 01:09:35PM -0700, Andrew Morton wrote:
> On Mon, 15 Mar 2010 13:34:50 +0100
> Christian Ehrhardt <ehrhardt@xxxxxxxxxxxxxxxxxx> wrote:
>
> > c) If direct reclaim did reasonable progress in try_to_free but did not
> > get a page, AND there is no write in flight at all then let it try again
> > to free up something.
> > This could be extended by some kind of max retry to avoid some weird
> > looping cases as well.
> >
> > d) Another way might be as easy as letting congestion_wait return
> > immediately if there are no outstanding writes - this would keep the
> > behavior for cases with write and avoid the "running always in full
> > timeout" issue without writes.
>
> They're pretty much equivalent and would work. But there are two
> things I still don't understand:
>

Unfortunately, this regression is very poorly understood. I haven't been able
to reproduce it locally and while Christian has provided various debugging
information, it still isn't clear why the problem occurs now.

> 1: Why is direct reclaim calling congestion_wait() at all? If no
> writes are going on there's lots of clean pagecache around so reclaim
> should trivially succeed. What's preventing it from doing so?
>

Memory pressure I think. The workload involves 16 processes (see
http://lkml.org/lkml/2009/12/7/237). I suspect they are all direct reclaimers
and some processes are getting their pages stolen before they have a
chance to allocate them. It's knowing that adding a small amount of
memory "fixes" this problem.

> 2: This is, I think, new behaviour. A regression. What caused it?
>

Short answer, I don't know.

Longer answer. Initially, this was reported as being caused by commit e084b2d:
page-allocator: preserve PFN ordering when __GFP_COLD is set but it was never
established why and reverting it was unpalatable because it fixed another
performance problem. According to Christian, the controller does nothing
with the merging of IO requests and he was very sure about this. As all the
patch does is change the order that pages are returned in and the timing
slightly due to differences in cache hotness, although the fact that such
a small change could make a big difference in reclaim later was surprising.
There were other bugs that might have complicated this such as errors in free
page counters but they were fixed up and the problem still did not go away.

It was after much debugging that it was found that direct reclaim was
returning, the subsequent allocation attempt failed and congestion_wait()
was called but without dirty pages, congestion or writes, it waits for
the full timeout. congestion_wait() was also being called a lot more
frequently so something was causing reclaim to fail more frequently
(http://lkml.org/lkml/2009/12/18/150). Again, I couldn't figure out why
e084b2d would make a difference.

Later, it got even worse because patches e084b2d and 5f8dcc21 had to be
reverted in 2.6.33 to "resolve" the problem. 5f8dcc21 was more plausible as it
affected how many pages were on the per-cpu lists but making it behave like
2.6.32 did not help the situation. Again, it looked like a very small timing
problem but it could not be isolated exactly why reclaim would fail. Again,
other bugs were found and fixed but made no difference.

What lead to this patch was recognising we could enter congestion_wait()
and wait the entire timeout because no writes were in progress or dirty
pages to be cleaned. As what was really of interest was watermarks in
this path, the patch intended to make the page allocator care about
watermarks instead of congestion. We know it was treating symptoms
rather than understanding the underlying problem but I was somewhat at a
loss to explain why small changes in timing made such a large
difference.

Any new insight is welcome.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/