Re: [v8 0/4] cgroup-aware OOM killer

From: Michal Hocko
Date: Tue Sep 26 2017 - 07:21:42 EST


On Tue 26-09-17 11:59:25, Roman Gushchin wrote:
> On Mon, Sep 25, 2017 at 10:25:21PM +0200, Michal Hocko wrote:
> > On Mon 25-09-17 19:15:33, Roman Gushchin wrote:
> > [...]
> > > I'm not against this model, as I've said before. It feels logical,
> > > and will work fine in most cases.
> > >
> > > In this case we can drop any mount/boot options, because it preserves
> > > the existing behavior in the default configuration. A big advantage.
> >
> > I am not sure about this. We still need an opt-in, ragardless, because
> > selecting the largest process from the largest memcg != selecting the
> > largest task (just consider memcgs with many processes example).
>
> As I understand Johannes, he suggested to compare individual processes with
> group_oom mem cgroups. In other words, always select a killable entity with
> the biggest memory footprint.
>
> This is slightly different from my v8 approach, where I treat leaf memcgs
> as indivisible memory consumers independent on group_oom setting, so
> by default I'm selecting the biggest task in the biggest memcg.

My reading is that he is actually proposing the same thing I've been
mentioning. Simply select the biggest killable entity (leaf memcg or
group_oom hierarchy) and either kill the largest task in that entity
(for !group_oom) or the whole memcg/hierarchy otherwise.

> While the approach suggested by Johannes looks clear and reasonable,
> I'm slightly concerned about possible implementation issues,
> which I've described below:
>
> >
> > > The only thing, I'm slightly concerned, that due to the way how we calculate
> > > the memory footprint for tasks and memory cgroups, we will have a number
> > > of weird edge cases. For instance, when putting a single process into
> > > the group_oom memcg will alter the oom_score significantly and result
> > > in significantly different chances to be killed. An obvious example will
> > > be a task with oom_score_adj set to any non-extreme (other than 0 and -1000)
> > > value, but it can also happen in case of constrained alloc, for instance.
> >
> > I am not sure I understand. Are you talking about root memcg comparing
> > to other memcgs?
>
> Not only, but root memcg in this case will be another complication. We can
> also use the same trick for all memcg (define memcg oom_score as maximum oom_score
> of the belonging tasks), it will turn group_oom into pure container cleanup
> solution, without changing victim selection algorithm

I fail to see the problem to be honest. Simply evaluate the memcg_score
you have so far with one minor detail. You only check memcgs which have
tasks (rather than check for leaf node check) or it is group_oom. An
intermediate memcg will get a cumulative size of the whole subhierarchy
and then you know you can skip the subtree because any subtree can be larger.

> But, again, I'm not against approach suggested by Johannes. I think that overall
> it's the best possible semantics, if we're not taking some implementation details
> into account.

I do not see those implementation details issues and let me repeat do
not develop a semantic based on implementation details.
--
Michal Hocko
SUSE Labs