Re: readahead and oom

From: Wu Fengguang
Date: Thu Apr 28 2011 - 00:20:04 EST


On Wed, Apr 27, 2011 at 03:47:43AM +0800, Andrew Morton wrote:
> On Tue, 26 Apr 2011 17:20:29 +0800
> Wu Fengguang <fengguang.wu@xxxxxxxxx> wrote:
>
> > Pass __GFP_NORETRY|__GFP_NOWARN for readahead page allocations.
> >
> > readahead page allocations are completely optional. They are OK to
> > fail and in particular shall not trigger OOM on themselves.
>
> I have distinct recollections of trying this many years ago, finding
> that it caused problems then deciding not to do it. But I can't find
> an email trail and I don't remember the reasons :(

The most possible reason can be page allocation failures even if there
are plenty of _global_ reclaimable pages.

> If the system is so stressed for memory that the oom-killer might get
> involved then the readahead pages may well be getting reclaimed before
> the application actually gets to use them. But that's just an aside.

Yes, when direct reclaim is working as expected, readahead thrashing
should happen long before NORETRY page allocation failures and OOM.

With that assumption I think it's OK to do this patch. As for
readahead, sporadic allocation failures are acceptable. But there is a
problem, see below.

> Ho hum. The patch *seems* good (as it did 5-10 years ago ;)) but there
> may be surprising side-effects which could be exposed under heavy
> testing. Testing which I'm sure hasn't been performed...

The NORETRY direct reclaim does tend to fail a lot more on concurrent
reclaims, where one task's reclaimed pages can be stoled by others
before it's able to get it.

__alloc_pages_direct_reclaim()
{
did_some_progress = try_to_free_pages();

// pages stolen by others

page = get_page_from_freelist();
}

Here are the tests to demonstrate this problem.

Out of 1000GB reads and page allocations,

test-ra-thrash.sh: read 1000 1G files interleaved in 1 single task:

nr_alloc_fail 733

test-dd-sparse.sh: read 1000 1G files concurrently in 1000 tasks:

nr_alloc_fail 11799


Thanks,
Fengguang
---

--- linux-next.orig/include/linux/mmzone.h 2011-04-27 21:58:27.000000000 +0800
+++ linux-next/include/linux/mmzone.h 2011-04-27 21:58:39.000000000 +0800
@@ -106,6 +106,7 @@ enum zone_stat_item {
NR_SHMEM, /* shmem pages (included tmpfs/GEM pages) */
NR_DIRTIED, /* page dirtyings since bootup */
NR_WRITTEN, /* page writings since bootup */
+ NR_ALLOC_FAIL,
#ifdef CONFIG_NUMA
NUMA_HIT, /* allocated in intended node */
NUMA_MISS, /* allocated in non intended node */
--- linux-next.orig/mm/page_alloc.c 2011-04-27 21:58:27.000000000 +0800
+++ linux-next/mm/page_alloc.c 2011-04-27 21:58:39.000000000 +0800
@@ -2176,6 +2176,8 @@ rebalance:
}

nopage:
+ inc_zone_state(preferred_zone, NR_ALLOC_FAIL);
+ /* count_zone_vm_events(PGALLOCFAIL, preferred_zone, 1 << order); */
if (!(gfp_mask & __GFP_NOWARN) && printk_ratelimit()) {
unsigned int filter = SHOW_MEM_FILTER_NODES;

--- linux-next.orig/mm/vmstat.c 2011-04-27 21:58:27.000000000 +0800
+++ linux-next/mm/vmstat.c 2011-04-27 21:58:53.000000000 +0800
@@ -879,6 +879,7 @@ static const char * const vmstat_text[]
"nr_shmem",
"nr_dirtied",
"nr_written",
+ "nr_alloc_fail",

#ifdef CONFIG_NUMA
"numa_hit",

Attachment: test-dd-sparse.sh
Description: Bourne shell script

Attachment: test-ra-thrash.sh
Description: Bourne shell script