Re: [v8 0/4] cgroup-aware OOM killer

From: Michal Hocko
Date: Fri Sep 15 2017 - 06:59:27 EST


On Thu 14-09-17 09:05:48, Roman Gushchin wrote:
> On Thu, Sep 14, 2017 at 03:40:14PM +0200, Michal Hocko wrote:
> > On Wed 13-09-17 14:56:07, Roman Gushchin wrote:
> > > On Wed, Sep 13, 2017 at 02:29:14PM +0200, Michal Hocko wrote:
> > [...]
> > > > I strongly believe that comparing only leaf memcgs
> > > > is more straightforward and it doesn't lead to unexpected results as
> > > > mentioned before (kill a small memcg which is a part of the larger
> > > > sub-hierarchy).
> > >
> > > One of two main goals of this patchset is to introduce cgroup-level
> > > fairness: bigger cgroups should be affected more than smaller,
> > > despite the size of tasks inside. I believe the same principle
> > > should be used for cgroups.
> >
> > Yes bigger cgroups should be preferred but I fail to see why bigger
> > hierarchies should be considered as well if they are not kill-all. And
> > whether non-leaf memcgs should allow kill-all is not entirely clear to
> > me. What would be the usecase?
>
> We definitely want to support kill-all for non-leaf cgroups.
> A workload can consist of several cgroups and we want to clean up
> the whole thing on OOM.

Could you be more specific about such a workload? E.g. how can be such a
hierarchy handled consistently when its sub-tree gets killed due to
internal memory pressure? Or do you expect that none of the subtree will
have hard limit configured?

> I don't see any reasons to limit this functionality to leaf cgroups
> only.

Well, I wanted to start simple first and extend on top. Memcg v1 is full
of seemingly interesting and very generic concepts which turned out
being a headache long term.

> Hierarchies are memory consumers, we do account their usage,
> we do apply limits and guarantees for the hierarchies. The same is
> with OOM victim selection: we are reclaiming memory from the
> biggest consumer. Kill-all knob only defines the way _how_ we do that:
> by killing one or all processes.

But then you just enforce a structural restriction on your configuration
because
root
/ \
A D
/\
B C

is a different thing than
root
/ | \
B C D

And consider that the sole purpose of A might be a control over
a non-memory resource (e.g. a cpu share distribution). Why should we
discriminate B and C in such a case?

> Just for example, we might want to take memory.low into account at
> some point: prefer cgroups which are above their guarantees, avoid
> killing those who fit. It would be hard if we're comparing cgroups
> from different hierarchies.

This can be reflected in the memcg oom score calculation I believe. We
already do something similar during the reclaim.

> The same will be with introducing oom_priorities, which is much more
> required functionality.

More on that below.

> > Consider that it might be not your choice (as a user) how deep is your
> > leaf memcg. I can already see how people complain that their memcg has
> > been killed just because it was one level deeper in the hierarchy...
>
> The kill-all functionality is enforced by parent, and it seems to be
> following the overall memcg design. The parent cgroup enforces memory
> limit, memory low limit, etc.

And the same is true for the memcg oom killer. It enforces the selection
to the out-of-memory subtree. We are trying to be proportional on the
size of the _reclaimable_ memory in that scope. Same way as with the
LRU reclaim. We do not prefer larger hierarchies over smaller. We just
iterate over those that have pages on LRUs (leaf memcgs with v2) and
scan/reclaim proportionally to their size. Why should the oom killer
decision be any different in that regards?

> I don't know why OOM control should be different.

I am not arguing that kill-all functionality on non-leaf is wrong. I
just haven't heard the usecase for it yet. I am also not opposed to
consider the cumulative size of non-leaf memcg if it is kill-all as the
cumulative size will be reclaimed then. But I fail to see why we should
prefer larger hierarchies when the resulting memcg victim is much
smaller in the end.

> > I would really start simple and only allow kill-all on leaf memcgs and
> > only compare leaf memcgs & root. If we ever need to kill whole
> > hierarchies then allow kill-all on intermediate memcgs as well and then
> > consider cumulative consumptions only on those that have kill-all
> > enabled.
>
> This sounds hacky to me: the whole thing is depending on cgroup v2 and
> is additionally explicitly opt-in.
>
> Why do we need to introduce such incomplete functionality first,
> and then suffer trying to extend it and provide backward compatibility?

Why would a backward compatibility be a problem? kill-all on non-leaf
memcgs should be seamless. We would simply allow setting the knob. Adding
a priority shouldn't be a problem either. A new knob would be added. Any
memcg with a non-zero priority would be considered during selection for
example (cumulative size would be considered as a tie-breaker for
non-leaf memcgs and a victim selected from the largest hierarchy/leaf
memcg - but that really needs to be thought through and hear about
specific usecases).

> Also, I think we should compare root cgroup with top-level cgroups,
> rather than leaf cgroups. A process in the root cgroup is definitely
> system-level entity, and we should compare it with other top-level
> entities (other containerized workloads), rather then some random
> leaf cgroup deep inside the tree. If we decided, that we're not comparing
> random tasks from different cgroups, why should we do this for leaf
> cgroups? Is sounds like making only one step towards right direction,
> while we can do more.

The main problem I have with that is mentioned above. A single hierarchy
enforces some structural constrains when multiple controllers are in
place.
--
Michal Hocko
SUSE Labs