Re: [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable

From: David Rientjes
Date: Fri Jan 26 2018 - 17:20:32 EST


On Thu, 25 Jan 2018, Andrew Morton wrote:

> > Now that each mem cgroup on the system has a memory.oom_policy tunable to
> > specify oom kill selection behavior, remove the needless "groupoom" mount
> > option that requires (1) the entire system to be forced, perhaps
> > unnecessarily, perhaps unexpectedly, into a single oom policy that
> > differs from the traditional per process selection, and (2) a remount to
> > change.
> >
> > Instead of enabling the cgroup aware oom killer with the "groupoom" mount
> > option, set the mem cgroup subtree's memory.oom_policy to "cgroup".
>
> Can we retain the groupoom mount option and use its setting to set the
> initial value of every memory.oom_policy? That way the mount option
> remains somewhat useful and we're back-compatible?
>

-ECONFUSED. We want to have a mount option that has the sole purpose of
doing echo cgroup > /mnt/cgroup/memory.oom_policy?

Please note that this patchset is not only to remove a mount option, it is
to allow oom policies to be configured per subtree such that users whom
you delegate those subtrees to cannot evade the oom policy that is set at
a higher level. The goal is to prevent the user from needing to organize
their hierarchy is a specific way to work around this constraint and use
things like limiting the number of child cgroups that user is allowed to
create only to work around the oom policy. With a cgroup v2 single
hierarchy it severely limits the amount of control the user has over their
processes because they are locked into a very specific hierarchy
configuration solely to not allow the user to evade oom kill.

This, and fixes to fairly compare the root mem cgroup with leaf mem
cgroups, are essential before the feature is merged otherwise it yields
wildly unpredictable (and unexpected, since its interaction with
oom_score_adj isn't documented) results as I already demonstrated where
cgroups with 1GB of usage are killed instead of 6GB workers outside of
that subtree.