Re: [patch -mm v3 1/3] mm, memcg: introduce per-memcg oom policy tunable
From: David Rientjes
Date: Wed Mar 14 2018 - 16:59:10 EST
On Wed, 14 Mar 2018, Roman Gushchin wrote:
> > The cgroup aware oom killer is needlessly enforced for the entire system
> > by a mount option. It's unnecessary to force the system into a single
> > oom policy: either cgroup aware, or the traditional process aware.
>
> Can you, please, provide a real-life example, when using per-process
> and cgroup-aware OOM killer depending on OOM scope is beneficial?
>
Hi Roman,
This question is only about per-process vs cgroup-aware, not about the
need for individual cgroup vs hierarchical subtree, so I'll only focus on
that.
Three reasons:
- Does not lock the entire system into a single methodology. Users
working in a subtree can default to what they are used to: per-process
oom selection even though their subtree might be targeted by a system
policy level decision at the root. This allow them flexibility to
organize their subtree intuitively for use with other controllers in a
single hierarchy.
The real-world example is a user who currently organizes their subtree
for this purpose and has defined oom_score_adj appropriately and now
regresses if the admin mounts with the needless "groupoom" option.
- Allows changing the oom policy at runtime without remounting the entire
cgroup fs. Depending on how cgroups are going to be used, per-process
vs cgroup-aware may be mandated separately. This is a trait only of
the mem cgroup controller, the root level oom policy is no different
from the subtree and depends directly on how the subtree is organized.
If other controllers are already being used, requiring a remount to
change the system-wide oom policy is an unnecessary burden.
The real-world example is systems software that either supports user
subtrees or strictly subtrees that it maintains itself. While other
controllers are used, the mem cgroup oom policy can be changed at
runtime rather than requiring a remount and reorganizing other
controllers exactly as before.
- Can be extended to cgroup v1 if necessary. There is no need for a
new cgroup v1 mount option and mem cgroup oom selection is not
dependant on any functionality provided by cgroup v2. The policies
introduced here work exactly the same if used with cgroup v1.
The real-world example is a cgroup configuration that hasn't had
the ability to move to cgroup v2 yet and still would like to use
cgroup-aware oom selection with a very trivial change to add the
memory.oom_policy file to the cgroup v1 filesystem.
> It might be quite confusing, depending on configuration.
> From inside a container you can have different types of OOMs,
> depending on parent's cgroup configuration, which is not even
> accessible for reading from inside.
>
Right, and the oom is the result of the parent's limit that is outside the
control of the user. That limit, and now oom policy, is defined by the
user controlling the ancestor of the subtree. The user need not be
concerned that it was singled out for oom kill: that policy decision is
outside hiso or her control. memory.oom_group can certainly be delegated
to the user, but the targeting cannot be changed or evaded.
However, this patchset also provides them with the ability to define their
own oom policy for subcontainers that they create themselves.
> Also, it's probably good to have an interface to show which policies
> are available.
>
This comes back to the user interface question. I'm very happy to address
any way that the interface can be made better, even though I think what is
currently proposed is satisfactory. I think your comment eludes to thp
like enabling where we have "[always] madvise never"? I'm speculating
that you may be happier with memory.oom_policy becoming
"[none] cgroup tree" and extended for additional policies later?
Otherwise the user would need to try the write and test the return value,
which purely depends on whether the policy is available or not. I'm
rather indifferent to either interface, but if you would prefer the
"[none] cgroup tree" appearance, I'll change to that.