Re: [PATCH] fix spurious OOM kills

From: Andrea Arcangeli
Date: Thu Nov 11 2004 - 11:52:43 EST

On Thu, Nov 11, 2004 at 10:38:50AM -0200, Marcelo Tosatti wrote:
> Hi!
> On Thu, Nov 11, 2004 at 04:42:38PM +0100, Andrea Arcangeli wrote:
> > On Thu, Nov 11, 2004 at 09:29:22AM -0200, Marcelo Tosatti wrote:
> > > Hi,
> > >
> > > This is an improved version of OOM-kill-from-kswapd patch.
> > >
> > > I believe triggering the OOM killer from task reclaim context
> > > is broken because the chances that it happens increases as the amount
> > > of tasks inside reclaim increases - and that approach ignores efforts
> > > being done by kswapd, who is the main entity responsible for
> > > freeing pages.
> > >
> > > There have been a few problems pointed out by others (Andrea, Nick) on the
> > > last patch - this one solves them.
> >
> > I disagree about the design of killing anything from kswapd. kswapd is
> > an async helper like pdflush and it has no knowledge on the caller (it
> > cannot know if the caller is ok with the memory currently available in
> > the freelists, before triggering the oom).
> If zone_dma / zone_normal are below pages_min no caller is "OK with
> memory currently available" except GFP_ATOMIC/realtime callers.

If the GFP_DMA zone is filled, and nobody allocates with GFP_DMA,
nothing should be killed and everything should run fine, how can you
get this right from kswapd?

> > I'm just about to move the
> > oom killing away from vmscan.c to page_alloc.c which is basically the
> > opposite of moving the oom invocation from the task context to kswapd.
> > page_alloc.c in the task context is the only one who can know if
> > something has to be killed, vmscan.c cannot know. vmscan.c can only know
> > if something is still freeable, but if something isn't freeable it
> > doesn't mean that we've to kill anything
> Well Andrea, its not about "if something isnt freeable", its about
> "the VM is unable to make progress reclaiming pages".

"VM is unable to reclaim pages" == "nothing is freeable"

> > (for example if a task exited
> > or some dma or normal-zone or highmem memory was released by another
> > task while we were paging waiting for I/O).
> My last patch checks for pages_min before OOM killing, have you read it?

checking pages_min isn't correct anyways, the lowmem_reserve must taken
into account or you may not kill tasks when you should really kill

Plus you're checking for all zones, but kswapd cannot know that it
doesn't matter if the zone dma is under pages_min, as far as there's no

> > Every allocation is different and page_alloc.c is the only one who
> > knows what has to be done for every single allocation.
> OK, what do you propose? Its the third time I ask you this and got no
> concrete answer yet.

I want to move it to page_alloc.c (and up to the caller) and not in
kswapd, I mention this a few times.

> Sure, allocators should receive -ENOMEM whenever possible, but this
> is not the issue here.

it is the issue, because only the context of the task can choose if to
return -ENOMEM or to invoke the oom killer and try again.

> Triggering OOM killer on __alloc_pages() failure ?

yes, ideally I'd put the oom killer _outside_ alloc_pages, but just
moving it into alloc_pages should make things better than they are right
now in vmscan.c.

> Show us the code, please :)

I'm supposedly listening to a meeting right now, then I've a bad kernel
crash to debug with random mem corruption that I just managed to
reproduce deterministcally inside uml by emulating numa inside uml and
I'll be busy until next week at the very least. So I doubt I'll be able
to write any oom-related code until next week, sorry.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at