Re: [PATCH v13 0/7] cgroup-aware OOM killer

From: David Rientjes
Date: Tue Jan 16 2018 - 16:36:31 EST


On Mon, 15 Jan 2018, Michal Hocko wrote:

> > No, this isn't how kernel features get introduced. We don't design a new
> > kernel feature with its own API for a highly specialized usecase and then
> > claim we'll fix the problems later. Users will work around the
> > constraints of the new feature, if possible, and then we can end up
> > breaking them later. Or, we can pollute the mem cgroup v2 filesystem with
> > even more tunables to cover up for mistakes in earlier designs.
>
> This is a blatant misinterpretation of the proposed changes. I haven't
> heard _any_ single argument against the proposed user interface except
> for complaints for missing tunables. This is not how the kernel
> development works and should work. The usecase was clearly described and
> far from limited to a single workload or company.
>

The complaint about the user interface is that it is not extensible, as my
next line states. This doesn't need to be opted into with a mount option
locking the entire system into a single oom policy. That, itself, is the
result of a poor design. What is needed is a way for users to define an
oom policy that is generally useful, not something that is locked in for
the whole system. We don't need several different cgroup mount options
only for mem cgroup oom policies. We also don't need random
memory.groupoom files being added to the mem cgroup v2 filesystem only for
one or two particular policies and being no-ops otherwise. It can easily
be specified as part of the policy itself. My suggestion adds two new
files to the mem cgroup v2 filesystem and no mount option, and allows any
policy to be added later that only uses these two files. I see you've
ignored all of that in this email, so perhaps reading it would be
worthwhile so that you can provide constructive feedback.

> > The key point to all three of my objections: extensibility.
>
> And it has been argued that further _features_ can be added on top. I am
> absolutely fed up discussing those things again and again without any
> progress. You simply keep _ignoring_ counter arguments and that is a
> major PITA to be honest with you. You are basically blocking a useful
> feature because it doesn't solve your particular workload. This is
> simply not acceptable in the kernel development.
>

As the thread says, this has nothing to do with my own particular
workload, it has to do with three obvious shortcomings in the design that
the user has no control over. We can't add features on top if the
heuristic itself changes as a result of the proposal, it needs to be
introduced in an extensible way so that additional changes can be made
later, if necessary, while still working around the very obvious problems
with this current implementation. My suggestion is that we introduce a
way to define the oom policy once so that we don't have to change it later
and are left with needless mount options or mem cgroup v2 files that
become no-ops with the suggested design. I hope that you will read the
proposal for that extensible interface and comment on it about any
concerns that you have, because that feedback would generally be useful.

> > Both you and Michal have acknowledged blantently obvious shortcomings in
> > the design.
>
> What you call blatant shortcomings we do not see affecting any
> _existing_ workloads. If they turn out to be real issues then we can fix
> them without deprecating any user APIs added by this patchset.
>

There are existing workloads that use mem cgroup subcontainers purely for
tracking charging and vmscan stats, which results in this logic being
evaded. It's a real issue, and a perfectly acceptable usecase for mem
cgroup. It's a result of the entire oom policy either being opted into or
opted out of for the entire system and impossible for the user to
configure or avoid. That can be done better by enabling the oom policy
only for a subtree, as I've suggested, but you've ignored. It would also
render both the mount option and the additional file in the mem cgroup v2
filesystem added by this patchset to be no-ops.