Re: [PATCH] fix spurious OOM kills

From: Andrea Arcangeli
Date: Thu Nov 11 2004 - 16:48:57 EST


On Thu, Nov 11, 2004 at 11:56:14AM -0200, Marcelo Tosatti wrote:
> It does get it right. It only triggers OOM killer if _both_
> GFP_DMA / GFP_KERNEL are full _and_ there is a task failing
> to allocate/free memory.

you should check for highmem too before killing the task.

> I think you missed the "task_looping_oom" variable in the patch, which is
> set as soon as a task is unable to allocate/free memory. This variable
> is set where the code used to call the OOM killer.

why do you use a global variable to pass a message from the task context
to kswapd, when you can do the real thing in the task context since the
first place?

> But still, the main idea is valid here.

Putting the oom killer in kswapd is just flawed. I recall somebody
already put it in kswapd in the old days, but that can't work right in
all scenarios.

The single fact you had to implement the task_looping_oom variable,
means you also felt the need to send some bit of info to kswapd from the
task context, then why don't you do the real thing from the task
context where you've a load of additional information to know exactly
what to do in the first place instead of adding global variables?

> I'll say this again just in case: If ZONE_DMA and ZONE_NORMAL reclaiming
> efforts are in vain, and there is task which is looping on try_to_free_pages()
> unable to succeed, _and_ both DMA/normal are below pages_min, the OOM
> killer will be triggered.

the oom killer must be triggered even if only ZONE_DMA is under page_min
if somebody allocates with __GFP_NOFAIL|GFP_DMA. Those special cases
don't make much sense to me. The only one that can know what to do is
the task context, never kswapd. And I don't see the point of using the
task_looping_oom variable when you can invoke the oom killer from
page_alloc _after_ checking the watermarks again with lowmem_reserve
into the equation (something that sure kswapd can't do, unless you pass
more bits of info than just a 0 or 1 via task_looping_oom).

> (should be pages_min + higher protection).

higher protection requires you to define the classzone where the task is
allocating from, only the task context knows it.

> > Plus you're checking for all zones, but kswapd cannot know that it
> > doesn't matter if the zone dma is under pages_min, as far as there's no
> > GFP_DMA.
>
> You mean boxes with no DMA zone?

I mean GFP_DMA as an allocation from the DMA classzone. If nobody
allocates ram passing GFP_DMA to alloc pages, nobody should worry or
notice if the GFP_DMA is under pages_min, because no allocation risk to
fail.

The oom killing must be strictly connected with a classzone allocation
failing (the single zone doesn't matter) and if nobody uses GFP_DMA it
doesn't matter if the dma zone is under pages_min.

2.4 gets all of this perfectly right and for sure it's not even remotely
attempting at killing anything from kswapd (and infact there's not a
single bugreport outstanding). all it can happen in 2.4 is some wasted
cpu while kswapd tries to do the paging, but exactly because kswapd is
purerly an helper (like 2.6 right now too and I don't want to change
this bit, since kswapd in 2.6 looks sane, much saner than the task
context itself which is the real problematic part that needs fixing),
nothing risk to go wrong (you can only argue about performance issues
when the dma zone gets pinned and under watermark[].min ;).

> If the task chooses to return -ENOMEM it wont set "task_looping_oom"
> flag.
>
> OK - you are right to say that "only the context of the task can choose
> to return -ENOMEM or invoke the oom killer".
>
> This allocator-context-only information is passed to kswapd via
> "task_looping_oom".

what's the point of passing info to kswapd? why don't you schedule a
callback instead, why don't you use keventd instead of kswapd? I just
don't get what benefit you get from that except form complexity and
overhead. And to do it right right like this you'd need to pass more
than 1 bit of info.

I mean, one could even change the code to send the whole task
information to an userspace daemon, that will open a device driver and
issue an ioctl that eventually calls the oom killer on a certain pid, in
order to invoke the oom killer. I just don't see why to waste any effort
with non trivial message passing when the oom killer itself can be
invoked by page_alloc.c where all the watermark checks are already
implemented, without passing down information to some random kernel
daemon.

> Good luck!

thanks! ;)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/