Re: [PATCH 2/3] oom, oom_reaper: Try to reap tasks which skipregular OOM killer path

From: Michal Hocko
Date: Mon Apr 11 2016 - 08:02:45 EST


On Sat 09-04-16 13:39:30, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Fri 08-04-16 20:19:28, Tetsuo Handa wrote:
> > > I looked at next-20160408 but I again came to think that we should remove
> > > these shortcuts (something like a patch shown bottom).
> >
> > feel free to send the patch with the full description. But I would
> > really encourage you to check the history to learn why those have been
> > added and describe why those concerns are not valid/important anymore.
>
> I believe that past discussions and decisions about current code are too
> optimistic because they did not take 'The "too small to fail" memory-
> allocation rule' problem into account.

In most cases they were driven by _real_ usecases though. And that
is what matters. Theoretically possible issues which happen under
crazy workloads which are DoSing the machine already are not something
to optimize for. Sure we should try to cope with them as gracefully
as possible, no questions about that, but we should try hard not to
reintroduce previous issues during _sensible_ workloads.

> If you ignore me with "check the history to learn why those have been added
> and describe why those concerns are not valid/important anymore", I can do
> nothing. What are valid/important concerns that have higher priority than
> keeping 'The "too small to fail" memory-allocation rule' problem and continue
> telling a lie to end users? Please enumerate such concerns.

I feel like we are looping in a circle and I do not want to waste my
time repeating arguments which were already mentioned several times.
I have already told you that you have to justify potentially disruptive
changes properly. So far you are more focused on extreme cases while
you do not seem to care all that much about those which happen most of
the time. We surely do not want to regress there. If I am telling you
to study the history of our heuristics it is to _help_ you understand
why they have been introduced so that you can argue with the reasoning
and/or come up with improvements. Unless you start doing this chances
are that your patches will not see overly warm welcome.

> > Your way of throwing a large patch based on an extreme load which is
> > basically DoSing the machine is not the ideal one.
>
> You are not paying attention to real world's limitations I'm facing.

So far I haven't seen any _real_world_ example from you, to be honest.
All I can see is hammering the system with some DoS scenarios which
triggered different corner cases in the behavior. Those are good to make
us think about our limitations and think for longterm solutions.

> I have to waste my resource trying to identify and fix on behalf of
> customers before they determine the kernel version to use for their
> systems, for your way of thinking is that "We don't need to worry about
> it because I have never received such report"

No I am not saying that. I am saying that I have never seen a _properly_
configured system to blow up in a way that would trigger pathological
cases you are mentioning. And that is a big difference. You can
misconfigure your system in so many ways and put it on knees without a
way out.

With all due respect I will not continue in this line of discussion
because it doesn't lead anywhere.
--
Michal Hocko
SUSE Labs