Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable
From: David Rientjes
Date: Tue Jan 23 2018 - 17:22:24 EST
On Tue, 23 Jan 2018, Michal Hocko wrote:
> > It can't, because the current patchset locks the system into a single
> > selection criteria that is unnecessary and the mount option would become a
> > no-op after the policy per subtree becomes configurable by the user as
> > part of the hierarchy itself.
>
> This is simply not true! OOM victim selection has changed in the
> past and will be always a subject to changes in future. Current
> implementation doesn't provide any externally controlable selection
> policy and therefore the default can be assumed. Whatever that default
> means now or in future. The only contract added here is the kill full
> memcg if selected and that can be implemented on _any_ selection policy.
>
The current implementation of memory.oom_group is based on top of a
selection implementation that is broken in three ways I have listed for
months:
- allows users to intentionally/unintentionally evade the oom killer,
requires not locking the selection implementation for the entire
system, requires subtree control to prevent, makes a mount option
obsolete, and breaks existing users who would use the implementation
based on 4.16 if this were merged,
- unfairly compares the root mem cgroup vs leaf mem cgroup such that
users must structure their hierarchy only for 4.16 in such a way
that _all_ processes are under hierarchical control and have no
power to create sub cgroups because of the point above and
completely breaks any user of oom_score_adj in a completely
undocumented and unspecified way, such that fixing that breakage
would also break any existing users who would use the implementation
based on 4.16 if this were merged, and
- does not allow userspace to protect important cgroups, which can be
built on top.
I'm focused on fixing the breakage in the first two points since it
affects the API and we don't want to switch that out from the user. I
have brought these points up repeatedly and everybody else has actively
disengaged from development, so I'm proposing incremental changes that
make the cgroup aware oom killer have a sustainable API and isn't useful
only for a highly specialized usecase where everything is containerized,
nobody can create subcgroups, and nobody uses oom_score_adj to break the
root mem cgroup accounting.