Re: upcoming kerneloops.org item: get_page_from_freelist
From: Mel Gorman
Date: Mon Jun 29 2009 - 11:30:47 EST
On Wed, Jun 24, 2009 at 02:56:15PM -0700, Andrew Morton wrote:
> On Wed, 24 Jun 2009 13:13:48 -0700 (PDT)
> Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > On Wed, 24 Jun 2009, Andrew Morton wrote:
> > >
> > > If the caller gets oom-killed, the allocation attempt fails. Callers need
> > > to handle that.
> >
> > I actually disagree. I think we should just admit that we can always free
> > up enough space to get a few pages, in order to then oom-kill things.
>
> I'm unclear on precisely what you're proposing here?
>
As order <= PAGE_ALLOC_COSTLY_ORDER implies __GFP_NOFAIL, prehaps it
makes sense to change the check to
WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER)
? The temptation might be there to remove __GFP_NOFAIL for smaller orders but
it makes sense to have it available in case CONFIG_FAULT_INJECTION_DEBUG_FS
is set and randomly failing allocations that have serious consequences even
if handled.
> > This is not a new concept. oom has never been "immediately kill".
>
> Well, it has been immediate for a long time. A couple of reasons which
> I can recall:
>
> - A page-allocating process will oom-kill another process in the
> expectation that the killing will free up some memory. If the
> oom-killed process remains stuck in the page allocator, that doesn't
> work.
>
> - The oom-killed process might be holding locks (typically fs locks).
> This can cause an arbitrary number of other processes to be blocked.
> So to get the system unstuck we need the oom-killed process to
> immediately exit the page allocator, to handle the NULL return and to
> drop those locks.
>
> There may be other reasons - it was all a long time ago, and I've never
> personally hacked on the oom-killer much and I never get oom-killed.
> But given the amount of development work which goes on in there, some
> people must be getting massacred.
>
>
> A long time ago, the Suse kernel shipped with a largely (or
> completely?) disabled oom-killer. It removed the
> retry-small-allocations-for-ever logic and simply returned NULL to the
> caller. I never really understood what problem/thinking led Andrea to
> do that.
>
>
> But it's all a bit moot at present, as we seem to have removed the
> return-NULL-if-TIF_MEMDIE logic in Mel's post-2.6.30 merges. I think
> that was an accident:
>
> - /* This allocation should allow future memory freeing. */
> -
> rebalance:
> - if (((p->flags & PF_MEMALLOC) || unlikely(test_thread_flag(TIF_MEMDIE)))
> - && !in_interrupt()) {
> - if (!(gfp_mask & __GFP_NOMEMALLOC)) {
> -nofail_alloc:
> - /* go through the zonelist yet again, ignoring mins */
> - page = get_page_from_freelist(gfp_mask, nodemask, order,
> - zonelist, high_zoneidx, ALLOC_NO_WATERMARKS);
> - if (page)
> - goto got_pg;
> - if (gfp_mask & __GFP_NOFAIL) {
> - congestion_wait(WRITE, HZ/50);
> - goto nofail_alloc;
> - }
> - }
> - goto nopage;
> + /* Allocate without watermarks if the context allows */
> + if (alloc_flags & ALLOC_NO_WATERMARKS) {
> + page = __alloc_pages_high_priority(gfp_mask, order,
> + zonelist, high_zoneidx, nodemask,
> + preferred_zone, migratetype);
> + if (page)
> + goto got_pg;
> }
>
> Offending commit 341ce06 handled the PF_MEMALLOC case but forgot about
> the TIF_MEMDIE case.
>
> Mel is having a bit of downtime at present.
I'm getting back online now and playing catch-up. You're right in that
TIF_MEMDIE returning NULL has been broken and it's possible in theory for an
OOM-killed process to loop forever. But maybe TIF_MEMDIE looping potentially
forever is expected in the case __GFP_NOFAIL is specified. Fixing this to allow
an OOM-killed process to exit does mean that callers using __GFP_NOFAIL must
still handle NULL being returned which might be very unexpected to the caller.
In 2.6.30, TIF_MEMDIE could cause a request to exit without ever entering
direct reclaim but chances are this didn't happen as it would have been
looping during an OOM-kill. To duplicate this, a check for TIF_MEMDIE would
happen after
/* Avoid recursion of direct reclaim */
if (p->flags & PF_MEMALLOC)
goto nopage;
But as failing __GFP_NOFAIL is potentially serious, even for processes that
have been OOM killed, I think it makes more sense to check for TIF_MEMDIE
after direct reclaim and OOM killing have already been considered as options
with a patch such as the following?
==== CUT HERE ====
page-allocator: Ensure that processes that have been OOM killed exit the page allocator
Processes that have been OOM killed set the thread flag TIF_MEMDIE. A
process such as this is expected to exit the page allocator but in the
event it happens to have set __GFP_NOFAIL, it potentially loops forever.
This patch checks TIF_MEMDIE when deciding whether to loop again in the
page allocator. Such a process will now return NULL after direct reclaim
and OOM killing have both been considered as options. The potential
problem is that a __GFP_NOFAIL allocation can still return failure so
callers must still handle getting returned NULL.
Signed-off-by: Mel Gorman <mel@xxxxxxxxx>
---
mm/page_alloc.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 5d714f8..8449cf9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1539,6 +1539,10 @@ should_alloc_retry(gfp_t gfp_mask, unsigned int order,
if (gfp_mask & __GFP_NORETRY)
return 0;
+ /* Do not loop if this process has been OOM-killed */
+ if (test_thread_flag(TIF_MEMDIE))
+ return 0;
+
/*
* In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
* means __GFP_NOFAIL, but that may not be true in other
@@ -1823,6 +1827,10 @@ rebalance:
!(gfp_mask & __GFP_NOFAIL))
goto nopage;
+ /* Do not loop if this process has been OOM-killed */
+ if (test_thread_flag(TIF_MEMDIE))
+ goto nopage;
+
goto restart;
}
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/