Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer

From: Michal Hocko
Date: Wed Sep 06 2017 - 04:29:10 EST


On Tue 05-09-17 17:53:44, Johannes Weiner wrote:
> On Tue, Sep 05, 2017 at 03:44:12PM +0200, Michal Hocko wrote:
> > Why is this an opt out rather than opt-in? IMHO the original oom logic
> > should be preserved by default and specific workloads should opt in for
> > the cgroup aware logic. Changing the global behavior depending on
> > whether cgroup v2 interface is in use is more than unexpected and IMHO
> > wrong approach to take. I think we should instead go with
> > oom_strategy=[alloc_task,biggest_task,cgroup]
> >
> > we currently have alloc_task (via sysctl_oom_kill_allocating_task) and
> > biggest_task which is the default. You are adding cgroup and the more I
> > think about the more I agree that it doesn't really make sense to try to
> > fit thew new semantic into the existing one (compare tasks to kill-all
> > memcgs). Just introduce a new strategy and define a new semantic from
> > scratch. Memcg priority and kill-all are a natural extension of this new
> > strategy. This will make the life easier and easier to understand by
> > users.
>
> oom_kill_allocating_task is actually a really good example of why
> cgroup-awareness *should* be the new default.
>
> Before we had the oom killer victim selection, we simply killed the
> faulting/allocating task. While a valid answer to the problem, it's
> not very fair or representative of what the user wants or intends.
>
> Then we added code to kill the biggest offender instead, which should
> have been the case from the start and was hence made the new default.
> The oom_kill_allocating_task was added on the off-chance that there
> might be setups who, for historical reasons, rely on the old behavior.
> But our default was chosen based on what behavior is fair, expected,
> and most reflective of the user's intentions.

I am not sure this is how things evolved actually. This is way before
my time so my git log interpretation might be imprecise. We do have
oom_badness heuristic since out_of_memory has been introduced and
oom_kill_allocating_task has been introduced much later because of large
boxes with zillions of tasks (SGI I suspect) which took too long to
select a victim so David has added this heuristic.

> The cgroup-awareness in the OOM killer is exactly the same thing. It
> should have been the default from the beginning, because the user
> configures a group of tasks to be an interdependent, terminal unit of
> memory consumption, and it's undesirable for the OOM killer to ignore
> this intention and compare members across these boundaries.

I would agree if that was true in general. I can completely see how the
cgroup awareness is useful in e.g. containerized environments (especially
with kill-all enabled) but memcgs are used in a large variety of
usecases and I cannot really say all of them really demand the new
semantic. Say I have a workload which doesn't want to see reclaim
interference from others on the same machine. Why should I kill a
process from that particular memcg just because it is the largest one
when there is a memory hog/leak outside of this memcg?

>From my point of view the safest (in a sense of the least surprise)
way to go with opt-in for the new heuristic. I am pretty sure all who
would benefit from the new behavior will enable it while others will not
regress in unexpected way.

We can talk about the way _how_ to control these oom strategies, of
course. But I would be really reluctant to change the default which is
used for years and people got used to it.
--
Michal Hocko
SUSE Labs