Re: [v7 2/5] mm, oom: cgroup-aware OOM killer

From: Roman Gushchin
Date: Wed Sep 06 2017 - 08:58:49 EST


On Wed, Sep 06, 2017 at 10:31:58AM +0200, Michal Hocko wrote:
> On Tue 05-09-17 21:23:57, Roman Gushchin wrote:
> > On Tue, Sep 05, 2017 at 04:57:00PM +0200, Michal Hocko wrote:
> [...]
> > > Hmm. The changelog says "By default, it will look for the biggest leaf
> > > cgroup, and kill the largest task inside." But you are accumulating
> > > oom_score up the hierarchy and so parents will have higher score than
> > > the layer of their children and the larger the sub-hierarchy the more
> > > biased it will become. Say you have
> > > root
> > > /\
> > > / \
> > > A D
> > > / \
> > > B C
> > >
> > > B (5), C(15) thus A(20) and D(20). Unless I am missing something we are
> > > going to go down A path and then chose C even though D is the largest
> > > leaf group, right?
> >
> > You're right, changelog is not accurate, I'll fix it.
> > The behavior is correct, IMO.
>
> Please explain why. This is really a non-intuitive semantic. Why should
> larger hierarchies be punished more than shallow ones? I would
> completely agree if the whole hierarchy would be a killable entity (aka
> A would be kill-all).

I think it's a reasonable and clear policy: we're looking for a memcg
with the smallest oom_priority and largest memory footprint recursively.
Then we reclaim some memory from it (by killing the biggest process
or all processes, depending on memcg preferences).

In general, if there are two memcgs of equal importance (which is defined
by setting the oom_priority), we're choosing the largest, because there
are more chances that it contain a leaking process. The same is true
right now for processes.

I agree, that for size-based comparison we could use a different policy:
comparing leaf cgroups despite their level. But I don't see a clever
way to apply oom_priorities in this case. Comparing oom_priority
on each level is a simple and powerful policy, and it works well
for delegation.

>
> [...]
> > > I do not understand why do we have to handle root cgroup specially here.
> > > select_victim_memcg already iterates all memcgs in the oom hierarchy
> > > (including root) so if the root memcg is the largest one then we
> > > should simply consider it no?
> >
> > We don't have necessary stats for the root cgroup, so we can't calculate
> > it's oom_score.
>
> We used to charge pages to the root memcg as well so we might resurrect
> that idea. In any case this is something that could be hidden in
> memcg_oom_badness rather then special cased somewhere else.

In theory I agree, but I do not see a good way to calculate root memcg
oom_score.

>
> > > You are skipping root there because of
> > > memcg_has_children but I suspect this and the whole accumulate up the
> > > hierarchy approach just makes the whole thing more complex than necessary. With
> > > "tasks only in leafs" cgroup policy we should only see any pages on LRUs
> > > on the global root memcg and leaf cgroups. The same applies to memcg
> > > stats. So why cannot we simply do the tree walk, calculate
> > > badness/check the priority and select the largest memcg in one go?
> >
> > We have to traverse from top to bottom to make priority-based decision,
> > but size-based oom_score is calculated as sum of descending leaf cgroup scores.
> >
> > For example:
> > root
> > /\
> > / \
> > A D
> > / \
> > B C
> > A and D have same priorities, B has larger priority than C.
> >
> > In this case we need to calculate size-based score for A, which requires
> > summing oom_score of the sub-tree (B an C), despite we don't need it
> > for choosing between B and C.
> >
> > Maybe I don't see it, but I don't know how to implement it more optimal.
>
> I have to think about the priority based oom killing some more to be
> honest. Do we really want to allow setting a priority to non-leaf
> memcgs? How are you going to manage the whole tree consistency? Say your
> above example have prio(A) < prio(D) && prio(C) > prio(D). Your current
> implementation would kill D, right?

Right.

> Isn't that counter intuitive
> behavior again. If anything we should prio(A) = max(tree_prio(A)). Again
> I could understand comparing priorities only on killable entities.

Answered above.
Also, I don't think any per-memcg knobs should have global meaning,
despite parent memcg settings. It will break delegation model.

Thanks!

Roman