Re: [v3 2/6] mm, oom: cgroup-aware OOM killer

From: David Rientjes
Date: Tue Jul 11 2017 - 16:56:49 EST


On Tue, 11 Jul 2017, Roman Gushchin wrote:

> > Yes, the original motivation was to limit killing to a single process, if
> > possible. To do that, we kill the process with the largest rss to free
> > the most memory and rely on the user to configure /proc/pid/oom_score_adj
> > if something else should be prioritized.
> >
> > With containerization and overcommit of system memory, we concur that
> > killing the single largest process isn't always preferable and neglects
> > the priority of its memcg. Your motivation seems to be to provide
> > fairness between one memcg with a large process and one memcg with a large
> > number of small processes; I'm curious if you are concerned about the
> > priority of a memcg hierarchy (how important that "job" is) or whether you
> > are strictly concerned with "largeness" of memcgs relative to each other.
>
> I'm pretty sure we should provide some way to prioritize some cgroups
> over other (in terms of oom killer preferences), but I'm not 100% sure yet,
> what's the best way to do it. I've suggested something similar to the existing
> oom_score_adj for tasks, mostly to folow the existing design.
>
> One of the questions to answer in priority-based model is
> how to compare tasks in the root cgroup with cgroups?
>

We do this with an alternate scoring mechanism, that is purely priority
based and tiebreaks based on largest rss. An additional tunable is added
for each process, under /proc/pid, and also to the memcg hierarchy, and is
enabled via a system-wide sysctl. I way to mesh the two scoring
mechanisms together would be helpful, but for our purposes we don't use
oom_score_adj at all, other than converting OOM_SCORE_ADJ_MIN to still be
oom disabled when written by third party apps.

For memcg oom conditions, iteration of the hierarchy begins at the oom
memcg. For system oom conditions, this is the root memcg.

All processes attached to the oom memcg have their priority based value
and this is compared to all child memcg's priority value at that level.
If a process has the lowest priority, it is killed and we're done; we
could implement a "kill all" mechanism for this memcg that is checked
before the process is killed.

If a memcg has the lowest priority compared to attached processes, it is
iterated as well, and so on throughout the memcg hierarchy until we find
the lowest priority process in the lowest priority leaf memcg. This way,
we can fully control which process is killed for both system and memcg oom
conditions. I can easily post patches for this, we have used it for
years.

> > These are two different things, right? We can adjust how the system oom
> > killer chooses victims when memcg hierarchies overcommit the system to not
> > strictly prefer the single process with the largest rss without killing
> > everything attached to the memcg.
>
> They are different, and I thought about providing two independent knobs.
> But after all I haven't found enough real life examples, where it can be useful.
> Can you provide something here?
>

Yes, we have users who we chown their memcg hierarchy to and have full
control over setting up their hierarchy however we want. Our "Activity
Manager", using Documentation/cgroup-v1/memory.txt terminology, only is
aware of the top level memcg that was chown'd to the user. That user runs
a series of batch jobs that are submitted to it and each job is
represented as a subcontainer to enforce strict limits on the amount of
memory that job can use. When it becomes oom, we have found that it is
preferable to oom kill the entire batch job rather than leave it in an
inconsistent state, so enabling such a knob here would be helpful.

Other top-level jobs are fine with individual processes being oom killed.
It can be a low priority process for which they have full control over
defining the priority through the new per-process and per-memcg value
described above. Easy example is scraping logs periodically or other
best-effort tasks like cleanup. They can happily be oom killed and
rescheduled without taking down the entire first-class job.

> Also, they are different only for non-leaf cgroups; leaf cgroups
> are always treated as indivisible memory consumers during victim selection.
>
> I assume, that containerized systems will always set oom_kill_all_tasks for
> top-level container memory cgroups. By default it's turned off
> to provide backward compatibility with current behavior and avoid
> excessive kills and support oom_score_adj==-1000 (I've added this to v4,
> will post soon).
>

We certainly would not be enabling it for top-level memcgs, there would be
no way that we could because we have best-effort processes, but we would
like to enable it for small batch jobs that are run on behalf of a user in
their own subcontainer. We have had this usecase for ~3 years and solely
because of the problem that you pointed out earlier: it is often much more
reliable for the kernel to do oom killing of multiple processes rather
than userspace.

> > In our methodology, each memcg is assigned a priority value and the
> > iteration of the hierarchy simply compares and visits the memcg with the
> > lowest priority at each level and then selects the largest process to
> > kill. This could also support a "kill-all" knob.
> >
> > struct mem_cgroup *memcg = root_mem_cgroup;
> > struct mem_cgroup *low_memcg;
> > unsigned long low_priority;
> >
> > next:
> > low_memcg = NULL;
> > low_priority = ULONG_MAX;
> > for_each_child_of_memcg(memcg) {
> > unsigned long prio = memcg_oom_priority(memcg);
> >
> > if (prio < low_priority) {
> > low_memcg = memcg;
> > low_priority = prio;
> > }
> > }
> > if (low_memcg)
> > goto next;
> > oom_kill_process_from_memcg(memcg);
> >
> > So this is a priority based model that is different than your aggregate
> > usage model but I think it allows userspace to define a more powerful
> > policy. We certainly may want to kill from a memcg with a single large
> > process, or we may want to kill from a memcg with several small processes,
> > it depends on the importance of that job.
>
> I believe, that both models have some advantages.
> Priority-based model is more powerful, but requires support from the userspace
> to set up these priorities (and, probably, adjust them dynamically).

It's a no-op if nobody sets up priorities or the system-wide sysctl is
disabled. Presumably, as in our model, the Activity Manager sets the
sysctl and is responsible for configuring the priorities if present. All
memcgs at the sibling level or subcontainer level remain the default if
not defined by the chown'd user, so this falls back to an rss model for
backwards compatibility.