Re: [v8 0/4] cgroup-aware OOM killer

From: Shakeel Butt
Date: Sun Oct 01 2017 - 19:30:12 EST


>
> Going back to Michal's example, say the user configured the following:
>
> root
> / \
> A D
> / \
> B C
>
> A global OOM event happens and we find this:
> - A > D
> - B, C, D are oomgroups
>
> What the user is telling us is that B, C, and D are compound memory
> consumers. They cannot be divided into their task parts from a memory
> point of view.
>
> However, the user doesn't say the same for A: the A subtree summarizes
> and controls aggregate consumption of B and C, but without groupoom
> set on A, the user says that A is in fact divisible into independent
> memory consumers B and C.
>
> If we don't have to kill all of A, but we'd have to kill all of D,
> does it make sense to compare the two?
>

I think Tim has given very clear explanation why comparing A & D makes
perfect sense. However I think the above example, a single user system
where a user has designed and created the whole hierarchy and then
attaches different jobs/applications to different nodes in this
hierarchy, is also a valid scenario. One solution I can think of, to
cater both scenarios, is to introduce a notion of 'bypass oom' or not
include a memcg for oom comparision and instead include its children
in the comparison.

So, in the same above example:
root
/ \
A(b) D
/ \
B C

A is marked as bypass and thus B and C are to be compared to D. So,
for the single user scenario, all the internal nodes are marked
'bypass oom comparison' and oom_priority of the leaves has to be set
to the same value.

Below is the pseudo code of select_victim_memcg() based on this idea
and David's previous pseudo code. The calculation of size of a memcg
is still not very well baked here yet. I am working on it and I plan
to have a patch based on Roman's v9 "mm, oom: cgroup-aware OOM killer"
patch.


struct mem_cgroup *memcg = root_mem_cgroup;
struct mem_cgroup *selected_memcg = root_mem_cgroup;
struct mem_cgroup *low_memcg;
unsigned long low_priority;
unsigned long prev_badness = memcg_oom_badness(memcg); // Roman's code
LIST_HEAD(queue);

next_level:
low_memcg = NULL;
low_priority = ULONG_MAX;

next:
for_each_child_of_memcg(it, memcg) {
unsigned long prio = it->oom_priority;
unsigned long badness = 0;

if (it->bypass_oom && !it->oom_group &&
memcg_has_children(it)) {
list_add(&it->oom_queue, &queue);
continue;
}

if (prio > low_priority)
continue;

if (prio == low_priority) {
badness = mem_cgroup_usage(it); // for
simplicity, need more thinking
if (badness < prev_badness)
continue;
}

low_memcg = it;
low_priority = prio;
prev_badness = badness ?: mem_cgroup_usage(it); //
for simplicity
}
if (!list_empty(&queue)) {
memcg = list_last_entry(&queue, struct mem_cgroup, oom_queue);
list_del(&memcg->oom_queue);
goto next;
}
if (low_memcg) {
selected_memcg = memcg = low_memcg;
prev_badness = 0;
if (!low_memcg->oom_group)
goto next_level;
}
if (selected_memcg->oom_group)
oom_kill_memcg(selected_memcg);
else
oom_kill_process_from_memcg(selected_memcg);