On Wed, 15 Jul 2009 22:38:53 -0400 Rik van Riel <riel@xxxxxxxxxx> wrote:
When way too many processes go into direct reclaim, it is possible
for all of the pages to be taken off the LRU. One result of this
is that the next process in the page reclaim code thinks there are
no reclaimable pages left and triggers an out of memory kill.
One solution to this problem is to never let so many processes into
the page reclaim path that the entire LRU is emptied. Limiting the
system to only having half of each inactive list isolated for
reclaim should be safe.
Since when? Linux page reclaim has a bilion machine years testing and
now stuff like this turns up. Did we break it or is this a
never-before-discovered workload?
@@ -1049,6 +1070,10 @@ static unsigned long shrink_inactive_lis
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
int lumpy_reclaim = 0;
+ while (unlikely(too_many_isolated(zone, file))) {
+ schedule_timeout_interruptible(HZ/10);
+ }
This (incorrectly-laid-out) code is a no-op if signal_pending().