Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable

From: Michal Hocko
Date: Fri Jan 26 2018 - 05:07:38 EST


On Thu 25-01-18 15:27:29, David Rientjes wrote:
> On Thu, 25 Jan 2018, Michal Hocko wrote:
>
> > > As a result, this would remove patch 3/4 from the series. Do you have any
> > > other feedback regarding the remainder of this patch series before I
> > > rebase it?
> >
> > Yes, and I have provided it already. What you are proposing is
> > incomplete at best and needs much better consideration and much more
> > time to settle.
> >
>
> Could you elaborate on why specifying the oom policy for the entire
> hierarchy as part of the root mem cgroup and also for individual subtrees
> is incomplete? It allows admins to specify and delegate policy decisions
> to subtrees owners as appropriate. It addresses your concern in the
> /admins and /students example. It addresses my concern about evading the
> selection criteria simply by creating child cgroups. It appears to be a
> win-win. What is incomplete or are you concerned about?

I will get back to this later. I am really busy these days. This is not
a trivial thing at all.

> > > I will address the unfair root mem cgroup vs leaf mem cgroup comparison in
> > > a separate patchset to fix an issue where any user of oom_score_adj on a
> > > system that is not fully containerized gets very unusual, unexpected, and
> > > undocumented results.
> >
> > I will not oppose but as it has been mentioned several times, this is by
> > no means a blocker issue. It can be added on top.
> >
>
> The current implementation is only useful for fully containerized systems
> where no processes are attached to the root mem cgroup. Anything in the
> root mem cgroup is judged by different criteria and if they use
> /proc/pid/oom_score_adj the entire heuristic breaks down.

Most usecases I've ever seen usually use oom_score_adj only to disable
the oom killer for a particular service. In those case the current
heuristic works reasonably well.

I am not _aware_ of any usecase which actively uses oom_score_adj to
actively control the oom selection decisions and it would _require_ the
memcg aware oom killer. Maybe there are some but then we need to do much
more than to "fix" the root memcg comparison. We would need a complete
memcg aware prioritization as well. It simply doesn't make much sense
to tune oom selection only on subset of tasks ignoring the rest of the
system workload which is likely to comprise the majority of the resource
consumers.

We have already discussed that something like that will emerge sooner or
later but I am not convinced we need it _now_. It is perfectly natural
to start with a simple model without any priorities at all.

> That's because per-process usage and oom_score_adj are only relevant
> for the root mem cgroup and irrelevant when attached to a leaf.

This is the simplest implementation. You could go and ignore
oom_score_adj on root tasks. Would it be much better? Should you ignore
oom disabled tasks? Should you consider kernel memory footprint of those
tasks? Maybe we will realize that we simply have to account root memcg
like any other memcg. We used to do that but it has been reverted due
to performance footprint. There are more questions to answer I believe
but the most important one is whether actually any _real_ user cares.

I can see your arguments and they are true. You can construct setups
where the current memcg oom heuristic works sub-optimally. The same has
been the case for the OOM killer in general. The OOM killer has always
been just a heuristic and there always be somebody complaining. This
doesn't mean we should just remove it because it works reasonably well
for most users.

> Because of that, users are
> affected by the design decision and will organize their hierarchies as
> approrpiate to avoid it. Users who only want to use cgroups for a subset
> of processes but still treat those processes as indivisible logical units
> when attached to cgroups find that it is simply not possible.

Nobody enforces the memcg oom selection as presented here for those
users. They have to explicitly _opt-in_. If the new heuristic doesn't
work for them we will hear about that most likely. I am really skeptical
that oom_score_adj can be reused for memcg aware oom selection.

> I'm focused solely on fixing the three main issues that this
> implementation causes. One of them, userspace influence to protect
> important cgroups, can be added on top. The other two, evading the
> selection criteria and unfair comparison of root vs leaf, are shortcomings
> in the design that I believe should be addressed before it's merged to
> avoid changing the API later.

I believe I have explained why the root memcg comparison is an
implementation detail. The subtree delegation is something that we will
have to care eventually. But I do not see it as an immediate thread.
Same as I do not see the current OOM killer flawed because there are
ways to evade from it. Moreover the delegation is much less of a problem
because creating subgroups is usually a privileged operation and it
requires quite some care already. This is much a higher bar than a
simple fork and hide games in the global case.

> I'm in no rush to ask for the cgroup aware
> oom killer to be merged if it's incomplete and must be changed for
> usecases that are not highly specialized (fully containerized and no use
> of oom_score_adj for any process).

You might be not in a rush but it feels rather strange to block a
feature other people want to use.

> I am actively engaged in fixing it,
> however, so that it becomes a candidate for merge.

I do not think anything you have proposed so far is even close to
mergeable state. I think you are simply oversimplifying this. We have
spent many months discussing different aspects of the memcg aware OOM
killer. The result is a compromise that should work reasonably well
for the targeted usecases and it doesn't bring unsustainable APIs that
will get carved into stone.
--
Michal Hocko
SUSE Labs