Re: [v11 3/6] mm, oom: cgroup-aware OOM killer

From: David Rientjes
Date: Mon Oct 09 2017 - 17:53:02 EST


On Thu, 5 Oct 2017, Roman Gushchin wrote:

> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
>
> This behavior doesn't suit well the system with many running
> containers:
>
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
>
> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
>

I'd move the second point to the changelog for the next patch since this
patch doesn't implement any support for memory.oom_group.

> To address these issues, the cgroup-aware OOM killer is introduced.
>
> Under OOM conditions, it looks for the biggest leaf memory cgroup
> and kills the biggest task belonging to it. The following patches
> will extend this functionality to consider non-leaf memory cgroups
> as well, and also provide an ability to kill all tasks belonging
> to the victim cgroup.
>
> The root cgroup is treated as a leaf memory cgroup, so it's score
> is compared with leaf memory cgroups.
> Due to memcg statistics implementation a special algorithm
> is used for estimating it's oom_score: we define it as maximum
> oom_score of the belonging tasks.
>

This seems to unfairly bias the root mem cgroup depending on process size.
It isn't treated fairly as a leaf mem cgroup if they are being compared
based on different criteria: the root mem cgroup as (mostly) the largest
rss of a single process vs leaf mem cgroups as all anon, unevictable, and
unreclaimable slab pages charged to it by all processes.

I imagine a configuration where the root mem cgroup has 100 processes
attached each with rss of 80MB, compared to a leaf cgroup with 100
processes of 1MB rss each. How does this logic prevent repeatedly oom
killing the processes of 1MB rss?

In this case, "the root cgroup is treated as a leaf memory cgroup" isn't
quite fair, it can simply hide large processes from being selected. Users
who configure cgroups in a unified hierarchy for other resource
constraints are penalized for this choice even though the mem cgroup with
100 processes of 1MB rss each may not be limited itself.

I think for this comparison to be fair, it requires accounting for the
root mem cgroup itself or for a different accounting methodology for leaf
memory cgroups.

> +/*
> + * Checks if the given memcg is a valid OOM victim and returns a number,
> + * which means the folowing:
> + * -1: there are inflight OOM victim tasks, belonging to the memcg
> + * 0: memcg is not eligible, e.g. all belonging tasks are protected
> + * by oom_score_adj set to OOM_SCORE_ADJ_MIN
> + * >0: memcg is eligible, and the returned value is an estimation
> + * of the memory footprint
> + */
> +static long oom_evaluate_memcg(struct mem_cgroup *memcg,
> + const nodemask_t *nodemask,
> + unsigned long totalpages)
> +{
> + struct css_task_iter it;
> + struct task_struct *task;
> + int eligible = 0;
> +
> + /*
> + * Memcg is OOM eligible if there are OOM killable tasks inside.
> + *
> + * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
> + * as unkillable.
> + *
> + * If there are inflight OOM victim tasks inside the memcg,
> + * we return -1.
> + */
> + css_task_iter_start(&memcg->css, 0, &it);
> + while ((task = css_task_iter_next(&it))) {
> + if (!eligible &&
> + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
> + eligible = 1;
> +
> + if (tsk_is_oom_victim(task) &&
> + !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
> + eligible = -1;
> + break;
> + }
> + }
> + css_task_iter_end(&it);
> +
> + if (eligible <= 0)
> + return eligible;
> +
> + return memcg_oom_badness(memcg, nodemask, totalpages);
> +}
> +
> +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> +{
> + struct mem_cgroup *iter;
> +
> + oc->chosen_memcg = NULL;
> + oc->chosen_points = 0;
> +
> + /*
> + * The oom_score is calculated for leaf memory cgroups (including
> + * the root memcg).
> + */
> + rcu_read_lock();
> + for_each_mem_cgroup_tree(iter, root) {
> + long score;
> +
> + if (memcg_has_children(iter) && iter != root_mem_cgroup)
> + continue;

I'll reiterate what I did on the last version of the patchset: considering
only leaf memory cgroups easily allows users to defeat this heuristic and
bias against all of their memory usage up to the largest process size
amongst the set of processes attached. If the user creates N child mem
cgroups for their N processes and attaches one process to each child, the
_only_ thing this achieved is to defeat your heuristic and prefer other
leaf cgroups simply because those other leaf cgroups did not do this.

Effectively:

for i in $(cat cgroup.procs); do mkdir $i; echo $i > $i/cgroup.procs; done

will radically shift the heuristic from a score of all anonymous +
unevictable memory for all processes to a score of the largest anonymous +
unevictable memory for a single process. There is no downside or
ramifaction for the end user in doing this. When comparing cgroups based
on usage, it only makes sense to compare the hierarchical usage of that
cgroup so that attaching processes to descendants or splitting the
implementation of a process into several smaller individual processes does
not allow this heuristic to be defeated.

> +
> + score = oom_evaluate_memcg(iter, oc->nodemask, oc->totalpages);
> +
> + /*
> + * Ignore empty and non-eligible memory cgroups.
> + */
> + if (score == 0)
> + continue;
> +
> + /*
> + * If there are inflight OOM victims, we don't need
> + * to look further for new victims.
> + */
> + if (score == -1) {
> + oc->chosen_memcg = INFLIGHT_VICTIM;
> + mem_cgroup_iter_break(root, iter);
> + break;
> + }
> +
> + if (score > oc->chosen_points) {
> + oc->chosen_points = score;
> + oc->chosen_memcg = iter;
> + }

I'll reiterate what I did on previous versions of this patchset: this
effectively removes all control the user has from influencing oom victim
selection. Victim selection is very important, the user must be able to
influence that decision to prevent the loss of important work when the
system is out of memory.

This heuristic only uses user infleunce by considering whether a memory
cgroup is eligible depending on whether all processes have
/proc/pid/oom_score_adj == -1000 or not. It means a user must oom disable
all processes attached to an important memory cgroup that has not reached
its limit to prevent it from being oom killed with this heuristic. It
simply has no other choice. It cannot differentiate between two memory
cgroups where one is expected to have much higher memory usage, and should
be protected because of end user goals.

> + }
> +
> + if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
> + css_get(&oc->chosen_memcg->css);
> +
> + rcu_read_unlock();
> +}
> +
> +bool mem_cgroup_select_oom_victim(struct oom_control *oc)
> +{
> + struct mem_cgroup *root;
> +
> + if (mem_cgroup_disabled())
> + return false;
> +
> + if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> + return false;
> +
> + if (oc->memcg)
> + root = oc->memcg;
> + else
> + root = root_mem_cgroup;
> +
> + select_victim_memcg(root, oc);
> +
> + return oc->chosen_memcg;
> +}
> +
> /*
> * Reclaims as many pages from the given memcg as possible.
> *
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index ccdb7d34cd13..20e62ec32ba8 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -987,6 +987,27 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> __oom_kill_process(victim);
> }
>
> +static bool oom_kill_memcg_victim(struct oom_control *oc)
> +{
> +
> + if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
> + return oc->chosen_memcg;
> +
> + /* Kill a task in the chosen memcg with the biggest memory footprint */
> + oc->chosen_points = 0;
> + oc->chosen_task = NULL;
> + mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> +
> + if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
> + goto out;
> +
> + __oom_kill_process(oc->chosen_task);
> +
> +out:
> + mem_cgroup_put(oc->chosen_memcg);
> + return oc->chosen_task;
> +}
> +
> /*
> * Determines whether the kernel must panic because of the panic_on_oom sysctl.
> */
> @@ -1083,27 +1105,37 @@ bool out_of_memory(struct oom_control *oc)
> current->mm && !oom_unkillable_task(current, NULL, oc->nodemask) &&
> current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
> get_task_struct(current);
> - oc->chosen = current;
> + oc->chosen_task = current;
> oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
> return true;
> }
>
> + if (mem_cgroup_select_oom_victim(oc) && oom_kill_memcg_victim(oc)) {
> + delay = true;
> + goto out;
> + }
> +
> select_bad_process(oc);

This is racy because mem_cgroup_select_oom_victim() found an eligible
oc->chosen_memcg that is not INFLIGHT_VICTIM with at least one eligible
process but mem_cgroup_scan_task(oc->chosen_memcg) did not. It means if a
process cannot be killed because of oom_unkillable_task(), the only
eligible processes moved or exited, or the /proc/pid/oom_score_adj of the
eligible processes changed, we end up falling back to the complete
tasklist scan. It would be better for oom_evaluate_memcg() to consider
oom_unkillable_task() and also retry in the case where
oom_kill_memcg_victim() returns NULL.