Re: can't oom-kill zap the victim's memory?

From: Michal Hocko
Date: Mon Oct 05 2015 - 10:44:13 EST


On Fri 02-10-15 15:01:06, Linus Torvalds wrote:
> On Fri, Oct 2, 2015 at 8:36 AM, Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> >
> > Have they been reported/fixed? All kernel paths doing an allocation are
> > _supposed_ to check and handle ENOMEM. If they are not then they are
> > buggy and should be fixed.
>
> No. Stop this theoretical idiocy.
>
> We've tried it. I objected before people tried it, and it turns out
> that it was a horrible idea.
>
> Small kernel allocations should basically never fail, because we end
> up needing memory for random things, and if a kmalloc() fails it's
> because some application is using too much memory, and the application
> should be killed. Never should the kernel allocation fail. It really
> is that simple. If we are out of memory, that does not mean that we
> should start failing random kernel things.

But you do realize that killing a task as a memory reclaim technique is
not 100% reliable, right?

Any task might be blocked in an uninterruptible context (e.g. a mutex)
waiting for completion which depends on the allocation success. The page
allocator (resp. OOM killer) is not aware of these dependencies and I am
really skeptical it will ever be because dependency tracking is way too
expensive. So killing a task doesn't guarantee a forward progress.

So I can see basically only few ways out of this deadlock situation.
Either we face the reality and allow small allocations (withtout
__GFP_NOFAIL) to fail after all attempts to reclaim memory have failed
(so after even OOM killer hasn't made any progress).
Or we can start killing other tasks but this might end up in the same
state and the time to resolve the problem might be basically unbounded
(it is trivial to construct loads where hundreds of tasks are bashing
against a single i_mutex and all of them depending on an allocation...).
Or we can panic/reboot the system if the OOM situation cannot be solved
within a selected timeout.

There are other ways to micro-optimize the current implementation by
playing with memory reserves but all that is just postponing the final
disaster and there is still a point of no further progress that we have
to deal with somehow.

> So this "people should check for allocation failures" is bullshit.
> It's a computer science myth. It's simply not true in all cases.

Sure it is not true in _all_ cases. If some paths cannot fail they can
use __GFP_NOFAIL for that purpose. The point is that most allocations
_can_ handle the failure. People are taught to check for allocation
failures. We even have scripts/coccinelle/null/kmerr.cocci which helps
to detect slab allocator users to some degree.

> Kernel allocators that know that they do large allocations (ie bigger
> than a few pages) need to be able to handle the failure, but not the
> general case. Also, kernel allocators that know they have a good
> fallback (eg they try a large allocation first but can fall back to a
> smaller one) should use __GFP_NORETRY, but again, that does *not* in
> any way mean that general kernel allocations should randomly fail.
>
> So no. The answer is ABSOLUTELY NOT "everybody should check allocation
> failure". Get over it. I refuse to go through that circus again. It's
> stupid.
>
> Linus

--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/