Re: upcoming kerneloops.org item: get_page_from_freelist

From: David Rientjes
Date: Tue Jun 30 2009 - 04:13:48 EST


On Tue, 30 Jun 2009, Nick Piggin wrote:

> > That's not the expected behavior for TIF_MEMDIE, although your patch
> > certainly changes that.
> >
> > Your patch is simply doing
> >
> > if (test_thread_flag(TIF_MEMDIE))
> > gfp_mask |= __GFP_NORETRY;
> >
> > in the slowpath.
> >
> > TIF_MEMDIE is supposed to allow allocations to succeed, not automatically
> > fail, so that it can quickly handle its SIGKILL without getting blocked in
> > the exit path seeking more memory.
>
> Yes, it need to just ignore all watermarks, do not reclaim (we've
> already decided reclaim will not work at this point), and return a
> page if we have one otherwise NULL (unless GFP_NOFAIL is set).
>

Right, there's no sense in looping endlessly for ~__GFP_NOFAIL if
allocations continue to fail for a thread with TIF_MEMDIE set.

TIF_MEMDIE doesn't check any watermarks as opposed to GFP_ATOMIC, which
only reduces the min watermark by half, so we can access more memory
reserves with TIF_MEMDIE. Instead of immediately failing an oom killed
task's allocation as in Mel's patch, there is a higher liklihood that it
will succeed on the next attempt.

I'd agree with Mel's added check for TIF_MEMDIE upon returning from the
oom killer, but only for __GFP_NOMEMALLOC.

> > All __GFP_NOFAIL allocations should ensure that alloc_pages() never
> > returns NULL. Although it's unfortunate, that's the requirement that
> > callers have been guaranteed and until they are fixed, the page allocator
> > should respect it.
>
> Yes.
>
> Interesting thing is what to do when we have 0 pages left, we are
> TIF_MEMDIE, and GFP_NOFAIL is set. Looping will most likely just
> deadlock the system. Returning NULL will probably oops caller with
> various locks held and then deadlock the system. It really needs to
> punt back to the OOM killer so it can select another task. Until
> then, maybe a simple panic would be reasonable? (it's *never* going
> to hit anyone in practice I'd say, but if it does then a panic
> would be better than lockup at least we know what the problem was).
>

The oom killer currently is a no-op if any eligible task has TIF_MEMDIE,
so this would require adding an oom killer timeout so that if a task fails
to exit after a predefined period, TIF_MEMDIE is cleared and the task is
marked to no longer be selected (which would require an addition to
task_struct) although it may have already completely depleted memory
reserves.

The best alternative is just to increase min_free_kbytes to ensure that
adequate memory reserves (and its partial exemptions allowed by
GFP_ATOMIC, ALLOC_HARDER, and PF_MEMALLOC) are sustained for an oom killed
task to exit and that we try hard to avoid getting stuck in
TASK_UNINTERRUPTIBLE.

> > I disagree with this change because it unconditionally fails allocations
> > when a task has been oom killed, a scenario which should be the _highest_
> > priority for allocations to succeed since it leads to future memory
>
> That's another interesting point. I do agree with you because that
> would restore previous behaviour which got broken. But I wonder if
> possibly it would be a better idea to fail all allocations? That
> would a) protect reserves more, and b) probably quite likely to
> exit the syscall *sooner* than if we try to satisfy all allocations.
>

You could only fail the single allocation where you triggered the oom
killer and you were the task chosen to die, which is what Mel's patch
implemented in the first half. I agree that would protect the memory
reserves more.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/