Re: [v4 4/4] mm, oom, docs: describe the cgroup-aware OOM killer

From: David Rientjes
Date: Tue Aug 08 2017 - 19:24:39 EST


On Wed, 26 Jul 2017, Roman Gushchin wrote:

> +Cgroup-aware OOM Killer
> +~~~~~~~~~~~~~~~~~~~~~~~
> +
> +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> +It means that it treats memory cgroups as first class OOM entities.
> +
> +Under OOM conditions the memory controller tries to make the best
> +choise of a victim, hierarchically looking for the largest memory
> +consumer. By default, it will look for the biggest task in the
> +biggest leaf cgroup.
> +
> +Be default, all cgroups have oom_priority 0, and OOM killer will
> +chose the largest cgroup recursively on each level. For non-root
> +cgroups it's possible to change the oom_priority, and it will cause
> +the OOM killer to look athe the priority value first, and compare
> +sizes only of cgroups with equal priority.
> +
> +But a user can change this behavior by enabling the per-cgroup
> +oom_kill_all_tasks option. If set, it causes the OOM killer treat
> +the whole cgroup as an indivisible memory consumer. In case if it's
> +selected as on OOM victim, all belonging tasks will be killed.
> +
> +Tasks in the root cgroup are treated as independent memory consumers,
> +and are compared with other memory consumers (e.g. leaf cgroups).
> +The root cgroup doesn't support the oom_kill_all_tasks feature.
> +
> +This affects both system- and cgroup-wide OOMs. For a cgroup-wide OOM
> +the memory controller considers only cgroups belonging to the sub-tree
> +of the OOM'ing cgroup.
> +
> IO
> --

Thanks very much for following through with this.

As described in http://marc.info/?l=linux-kernel&m=149980660611610 this is
very similar to what we do for priority based oom killing.

I'm wondering your comments on extending it one step further, however:
include process priority as part of the selection rather than simply memcg
priority.

memory.oom_priority will dictate which memcg the kill will originate from,
but processes have no ability to specify that they should actually be
killed as opposed to a leaf memcg. I'm not sure how important this is for
your usecase, but we have found it useful to be able to specify process
priority as part of the decisionmaking.

At each level of consideration, we simply kill a process with lower
/proc/pid/oom_priority if there are no memcgs with a lower
memory.oom_priority. This allows us to define the exact process that will
be oom killed, absent oom_kill_all_tasks, and not require that the process
be attached to leaf memcg. Most notably these are processes that are best
effort: stats collection, logging, etc.

Do you think it would be helpful to introduce per-process oom priority as
well?