Re: [patch 0/5] mm: per-zone dirty limiting

From: Johannes Weiner
Date: Tue Sep 20 2011 - 08:20:21 EST

Next message: Adrian Hunter: "Re: [PATCH] mmc : export hw reset function info to user"
Previous message: Srikar Dronamraju: "[PATCH v5 3.1.0-rc4-tip 20/26] tracing: uprobes trace_event interface"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi, sorry for the long delay,

On Wed, Aug 03, 2011 at 02:18:11PM +0100, Mel Gorman wrote:
> On Tue, Aug 02, 2011 at 02:17:33PM +0200, Johannes Weiner wrote:
> > My theory is that the closer (file_pages - dirty_pages) is to the high
> > watermark which kswapd tries to balance to, the more likely it is to
> > run into dirty pages. And to my knowledge, these tests are run with a
> > non-standard 40% dirty ratio, which lowers the threshold at which
> > perzonedirty falls apart. Per-zone dirty limits should probably take
> > the high watermark into account.
> >
>
> That would appear sensible. The choice of 40% dirty ratio is deliberate.
> My understanding is a number of servers that are IO intensive will have
> dirty ratio tuned to this value. On bug reports I've seen for distro
> kernels related to IO slowdowns, it seemed to be a common choice. I
> suspect it's tuned to this because it used to be the old default. Of
> course, 40% also made the writeback problem worse so the effect of the
> patches is easier to see.

Agreed.

It was just meant as an observation/possible explanation for why this
might exacerbate adverse effects, no blaming, rest assured :)

I added a patch that excludes reserved pages from dirtyable memory and
file writes are now down to the occassional hundred pages once in ten
runs, even with a dirty ratio of 40%. I even ran a test with 40%
background and 80% foreground limit for giggles and still no writeouts
from reclaim with this patch, so this was probably it.

> > What makes me wonder, is that in addition, something in perzonedirty
> > makes kswapd less efficient in the 4G tests, which is the opposite
> > effect it had in all other setups. This increases direct reclaim
> > invocations against the preferred Normal zone. The higher pressure
> > could also explain why reclaim rushes through the clean pages and runs
> > into dirty pages quicker.
> >
> > Does anyone have a theory about what might be going on here?
> >
>
> This is tenuous at best and I confess I have not thought deeply
> about it but it could be due to the relative age of the pages in the
> highest zone.
>
> In the vanilla kernel, the Normal zone gets filled with dirty pages
> first and then the lower zones get used up until dirty ratio when
> flusher threads get woken. Because the highest zone also has the
> oldest pages and presumably the oldest inodes, the zone gets fully
> cleaned by the flusher. The pattern is "fill zone with dirty pages,
> use lower zones, highest zone gets fully cleaned reclaimed and refilled
> with dirty pages, repeat"
>
> In the patched kernel, lower zones are used when the dirty limits of a
> zone are met and the flusher threads are woken to clean a small number
> of pages but not the full zone. Reclaim takes the clean pages and they
> get replaced with younger dirty pages. Over time, the highest zone
> becomes a mix of old and young dirty pages. The flusher threads run
> but instead of cleaning the highest zone first, it is cleaning a mix
> of pages both all the zones. If this was the case, kswapd would end
> up writing more pages from the higher zone and stalling as a result.
>
> A further problem could be that direct reclaimers are hitting that new
> congestion_wait(). Unfortunately, I was not running with stats enabled
> to see what the congestion figures looked like.

The throttling could indeed uselessly force a NOFS allocation to wait
a bit without making progress, so kswapd could in turn get stuck
waiting on that allocator when calling into the fs.

I dropped the throttling completely for now and the zone dirty limits
are only applied in the allocator fast path to distribute allocations,
but not throttle/writeback anything. The direct reclaim invocations
are no longer increased.

This leaves the problem to allocations whose allowable zones are in
sum not big enough to trigger the global limit, but the series is
still useful without it and we can handle such situations in later
patches.

Thanks for your input, Mel, I'll shortly send out the latest revision.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Adrian Hunter: "Re: [PATCH] mmc : export hw reset function info to user"
Previous message: Srikar Dronamraju: "[PATCH v5 3.1.0-rc4-tip 20/26] tracing: uprobes trace_event interface"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]