Re: [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable

From: David Rientjes
Date: Fri Jan 26 2018 - 17:53:07 EST


On Fri, 26 Jan 2018, Andrew Morton wrote:

> > -ECONFUSED. We want to have a mount option that has the sole purpose of
> > doing echo cgroup > /mnt/cgroup/memory.oom_policy?
>
> Approximately. Let me put it another way: can we modify your patchset
> so that the mount option remains, and continues to have a sufficiently
> same effect? For backward compatibility.
>

The mount option would exist solely to set the oom policy of the root mem
cgroup, it would lose its effect of mandating that policy for any subtree
since it would become configurable by the user if delegated.

Let me put it another way: if the cgroup aware oom killer is merged for
4.16 without this patchset, userspace can reasonably infer the oom policy
from checking how cgroups were mounted. If my followup patchset were
merged for 4.17, that's invalid and it becomes dependent on kernel
version: it could have the "groupoom" mount option but configured through
the root mem cgroup's memory.oom_policy to not be cgroup aware at all.

That inconsistency, to me, is unfair to burden userspace with.

> > This, and fixes to fairly compare the root mem cgroup with leaf mem
> > cgroups, are essential before the feature is merged otherwise it yields
> > wildly unpredictable (and unexpected, since its interaction with
> > oom_score_adj isn't documented) results as I already demonstrated where
> > cgroups with 1GB of usage are killed instead of 6GB workers outside of
> > that subtree.
>
> OK, so Roman's new feature is incomplete: it satisfies some use cases
> but not others. And we kinda have a plan to address the other use
> cases in the future.
>

Those use cases are also undocumented such that the user doesn't know the
behavior they are opting into. Nowhere in the patchset does it mention
anything about oom_score_adj other than being oom disabled. It doesn't
mention that a per-process tunable now depends strictly on whether it is
attached to root or not. It specifies a fair comparison between the root
mem cgroup and leaf mem cgroups, which is obviously incorrect by the
implementation itself. So I'm not sure the user would know which use
cases it is valid for, which is why I've been trying to make it generally
purposeful and documented.

> There's nothing wrong with that! As long as we don't break existing
> setups while evolving the feature. How do we do that?
>

We'd break the setups that actually configure their cgroups and processes
to abide by the current implementation since we'd need to discount
oom_score_adj from the the root mem cgroup usage to fix it.

There hasn't been any feedback on v2 of my patchset that would suggest
changes are needed. I think we all recognize the shortcoming that it is
addressing. The only feedback on v1, the need for memory.oom_group as a
separate tunable, has been addressed in v2. What are we waiting for?