[PATCH] Remove OOM killer from try_to_free_pages / all_unreclaimable braindamage

From: Marcelo Tosatti
Date: Fri Nov 05 2004 - 18:20:15 EST


Hi,

As you know the OOM is very problematic in 2.6 right now - so I went
to investigate it.

Currently the oom killer is invoked from the task reclaim
code (try_to_free_pages), which IMO is fundamentally broken,
because its non deterministic - the chance the OOM killer
will be triggered increases as the number of tasks inside
reclaiming increases. And kswapd is freeing pages in parallel,
which is completly ignored by this approach.

In my opinion the correct approach is to trigger the OOM killer
when kswapd is unable to free pages. Once that is done, the number
of tasks inside page reclaim is irrelevant.

So the following patch moves the out_of_memory() call to
balance_pgdat(), and makes it dependant on success reclaiming pages
when work has actually been done, and on priority reaching zero.

It also removes the "If it's been a long time since last failure dont
OOM kill" logic which for me just tries to paper over a bigger issue.

Relying on this information (kswapd failure after DEF_PRIORITY passes)
to trigger the OOM killer seems to be very reliable - it needs some
more testing though.

With this in place I can't see spurious OOM kills - just need to guarantee
that it reliably OOM kills when we are really out of memory.

While doing this, I noticed that kswapd will happily go to sleep
if all zones have all_unreclaimable set. I bet this is the reason
for the page allocation failures we are seeing. So the patch
also makes balance_pgdat() NOT return and go to "loop_again"
instead in case of page shortage - even if all_unreclaimable is set.

Basically the "loop_again" logic IS NOT WORKING!

Comments?

My wife is almost killing me, its Friday night and I've been telling her
"just another minute" for hours. Have to run.

diff -Nur --show-c-function --exclude='*.orig' linux-2.6.10-rc1-mm2.orig/mm/oom_kill.c linux-2.6.10-rc1-mm2/mm/oom_kill.c
--- linux-2.6.10-rc1-mm2.orig/mm/oom_kill.c 2004-11-04 22:50:50.000000000 -0200
+++ linux-2.6.10-rc1-mm2/mm/oom_kill.c 2004-11-05 18:33:29.918459072 -0200
@@ -240,23 +240,23 @@ void out_of_memory(int gfp_mask)
* If it's been a long time since last failure,
* we're not oom.
*/
- if (since > 5*HZ)
- goto reset;
+ //if (since > 5*HZ)
+ // goto reset;

/*
* If we haven't tried for at least one second,
* we're not really oom.
*/
- since = now - first;
- if (since < HZ)
- goto out_unlock;
+ //since = now - first;
+ //if (since < HZ)
+ // goto out_unlock;

/*
* If we have gotten only a few failures,
* we're not really oom.
*/
- if (++count < 10)
- goto out_unlock;
+// if (++count < 10)
+// goto out_unlock;

/*
* If we just killed a process, wait a while
diff -Nur --show-c-function --exclude='*.orig' linux-2.6.10-rc1-mm2.orig/mm/vmscan.c linux-2.6.10-rc1-mm2/mm/vmscan.c
--- linux-2.6.10-rc1-mm2.orig/mm/vmscan.c 2004-11-04 22:50:50.000000000 -0200
+++ linux-2.6.10-rc1-mm2/mm/vmscan.c 2004-11-05 19:53:35.915836416 -0200
@@ -952,8 +952,6 @@ int try_to_free_pages(struct zone **zone
if (sc.nr_scanned && priority < DEF_PRIORITY - 2)
blk_congestion_wait(WRITE, HZ/10);
}
- if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY))
- out_of_memory(gfp_mask);
out:
for (i = 0; zones[i] != 0; i++) {
struct zone *zone = zones[i];
@@ -997,13 +995,15 @@ static int balance_pgdat(pg_data_t *pgda
int all_zones_ok;
int priority;
int i;
- int total_scanned, total_reclaimed;
+ int total_scanned, total_reclaimed, worked;
struct reclaim_state *reclaim_state = current->reclaim_state;
struct scan_control sc;

+
loop_again:
total_scanned = 0;
total_reclaimed = 0;
+ worked = 0;
sc.gfp_mask = GFP_KERNEL;
sc.may_writepage = 0;
sc.nr_mapped = read_page_state(nr_mapped);
@@ -1033,6 +1033,10 @@ loop_again:
if (zone->present_pages == 0)
continue;

+ if (!zone_watermark_ok(zone, order,
+ zone->pages_high, 0, 0, 0))
+ all_zones_ok = 0;
+
if (zone->all_unreclaimable &&
priority != DEF_PRIORITY)
continue;
@@ -1072,6 +1076,9 @@ scan:
if (zone->all_unreclaimable && priority != DEF_PRIORITY)
continue;

+ if (priority == 0)
+ worked = 1;
+
if (nr_pages == 0) { /* Not software suspend */
if (!zone_watermark_ok(zone, order,
zone->pages_high, end_zone, 0, 0))
@@ -1129,6 +1136,9 @@ out:
zone->prev_priority = zone->temp_priority;
}
if (!all_zones_ok) {
+ if (priority == 0 && !total_reclaimed && worked)
+ out_of_memory(GFP_KERNEL);
+
cond_resched();
goto loop_again;
}
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/