On Wed, 15 Jul 2009 23:10:43 -0400 Rik van Riel <riel@xxxxxxxxxx> wrote:
Andrew Morton wrote:On Wed, 15 Jul 2009 22:38:53 -0400 Rik van Riel <riel@xxxxxxxxxx> wrote:It's been there for years, in various forms. It hardly ever
When way too many processes go into direct reclaim, it is possibleSince when? Linux page reclaim has a bilion machine years testing and
for all of the pages to be taken off the LRU. One result of this
is that the next process in the page reclaim code thinks there are
no reclaimable pages left and triggers an out of memory kill.
One solution to this problem is to never let so many processes into
the page reclaim path that the entire LRU is emptied. Limiting the
system to only having half of each inactive list isolated for
reclaim should be safe.
now stuff like this turns up. Did we break it or is this a
never-before-discovered workload?
shows up, but Kosaki's patch series give us a nice chance to
fix it for good.
OK.
Good point, I should add some code to break out of page reclaim@@ -1049,6 +1070,10 @@ static unsigned long shrink_inactive_lisThis (incorrectly-laid-out) code is a no-op if signal_pending().
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
int lumpy_reclaim = 0;
+ while (unlikely(too_many_isolated(zone, file))) {
+ schedule_timeout_interruptible(HZ/10);
+ }
if a fatal signal is pending,
We can't just return NULL from __alloc_pages(), and if we can't
get a page from the freelists then we're just going to have to keep
reclaiming. So I'm not sure how we can do this.