Re: [RFC PATCH] mm, oom: cgroup-aware OOM-killer

From: Johannes Weiner
Date: Thu May 18 2017 - 15:23:00 EST


On Fri, May 19, 2017 at 04:37:27AM +1000, Balbir Singh wrote:
> On Fri, May 19, 2017 at 3:30 AM, Michal Hocko <mhocko@xxxxxxxxxx> wrote:
> > On Thu 18-05-17 17:28:04, Roman Gushchin wrote:
> >> Traditionally, the OOM killer is operating on a process level.
> >> Under oom conditions, it finds a process with the highest oom score
> >> and kills it.
> >>
> >> This behavior doesn't suit well the system with many running
> >> containers. There are two main issues:
> >>
> >> 1) There is no fairness between containers. A small container with
> >> a few large processes will be chosen over a large one with huge
> >> number of small processes.
> >>
> >> 2) Containers often do not expect that some random process inside
> >> will be killed. So, in general, a much safer behavior is
> >> to kill the whole cgroup. Traditionally, this was implemented
> >> in userspace, but doing it in the kernel has some advantages,
> >> especially in a case of a system-wide OOM.
> >>
> >> To address these issues, cgroup-aware OOM killer is introduced.
> >> Under OOM conditions, it looks for a memcg with highest oom score,
> >> and kills all processes inside.
> >>
> >> Memcg oom score is calculated as a size of active and inactive
> >> anon LRU lists, unevictable LRU list and swap size.
> >>
> >> For a cgroup-wide OOM, only cgroups belonging to the subtree of
> >> the OOMing cgroup are considered.
> >
> > While this might make sense for some workloads/setups it is not a
> > generally acceptable policy IMHO. We have discussed that different OOM
> > policies might be interesting few years back at LSFMM but there was no
> > real consensus on how to do that. One possibility was to allow bpf like
> > mechanisms. Could you explore that path?
>
> I agree, I think it needs more thought. I wonder if the real issue is something
> else. For example
>
> 1. Did we overcommit a particular container too much?
> 2. Do we need something like https://lwn.net/Articles/604212/ to solve
> the problem?

The occasional OOM kill is an unavoidable reality on our systems (and
I bet on most deployments). If we tried not to overcommit, we'd waste
a *lot* of memory.

The problem is when OOM happens, we really want the biggest *job* to
get killed. Before cgroups, we assumed jobs were processes. But with
cgroups, the user is able to define a group of processes as a job, and
then an individual process is no longer a first-class memory consumer.

Without a patch like this, the OOM killer will compare the sizes of
the random subparticles that the jobs in the system are composed of
and kill the single biggest particle, leaving behind the incoherent
remains of one of the jobs. That doesn't make a whole lot of sense.

If you want to determine the most expensive car in a parking lot, you
can't go off and compare the price of one car's muffler with the door
handle of another, then point to a wind shield and yell "This is it!"

You need to compare the cars as a whole with each other.

> 3. We have oom notifiers now, could those be used (assuming you are interested
> in non memcg related OOM's affecting a container

Right now, we watch for OOM notifications and then have userspace kill
the rest of a job. That works - somewhat. What remains is the problem
that I described above, that comparing individual process sizes is not
meaningful when the terminal memory consumer is a cgroup.

> 4. How do we determine limits for these containers? From a fariness
> perspective

How do you mean?