Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skip regular OOM killer path

From: Tetsuo Handa
Date: Mon Apr 11 2016 - 09:26:35 EST


Michal Hocko wrote:
> On Sat 09-04-16 13:39:30, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Fri 08-04-16 20:19:28, Tetsuo Handa wrote:
> > > > I looked at next-20160408 but I again came to think that we should remove
> > > > these shortcuts (something like a patch shown bottom).
> > >
> > > feel free to send the patch with the full description. But I would
> > > really encourage you to check the history to learn why those have been
> > > added and describe why those concerns are not valid/important anymore.
> >
> > I believe that past discussions and decisions about current code are too
> > optimistic because they did not take 'The "too small to fail" memory-
> > allocation rule' problem into account.
>
> In most cases they were driven by _real_ usecases though. And that
> is what matters. Theoretically possible issues which happen under
> crazy workloads which are DoSing the machine already are not something
> to optimize for. Sure we should try to cope with them as gracefully
> as possible, no questions about that, but we should try hard not to
> reintroduce previous issues during _sensible_ workloads.

I'm not requesting you to optimize for crazy workloads. None of my
customers intentionally put crazy workloads, but they are getting silent
hangups and I'm suspecting that something went wrong with memory management.
But there is no evidence because memory management subsystem remains silent.
You call my testcases DoS, but there is no evidence that my customers
are not hitting the same problem my testcases found.

I'm suggesting you to at least emit diagnostic messages when something went
wrong. That is what kmallocwd is for. And if you do not want to emit
diagnostic messages, I'm fine with timeout based approach.