Re: [PATCH v13 0/7] cgroup-aware OOM killer

From: David Rientjes
Date: Tue Jan 16 2018 - 16:22:02 EST


On Mon, 15 Jan 2018, Johannes Weiner wrote:

> > It's quite trivial to allow the root mem cgroup to be compared exactly the
> > same as another cgroup. Please see
> > https://marc.info/?l=linux-kernel&m=151579459920305.
>
> This only says "that will be fixed" and doesn't address why I care.
>

Because the design of the cgroup aware oom killer requires the entire
system to be fully containerized to be useful and select the
correct/anticipated victim. If anything is left behind in the root mem
cgroup, or on systems that use mem cgroups in ways that you do not, it
returns wildly unpredictable and downright wrong results; it factors
oom_score_adj into the selection criteria only for processes attached to
the root mem cgroup and ignores it otherwise. If we treated root and leaf
cgroups as comparable, this problem wouldn't arise.

> > > This assumes you even need one. Right now, the OOM killer picks the
> > > biggest MM, so you can evade selection by forking your MM. This patch
> > > allows picking the biggest cgroup, so you can evade by forking groups.
> >
> > It's quite trivial to prevent any cgroup from evading the oom killer by
> > either forking their mm or attaching all their processes to subcontainers.
> > Please see https://marc.info/?l=linux-kernel&m=151579459920305.
>
> This doesn't address anything I wrote.
>

It prevents both problems if you are attached to a mem cgroup subtree.
You can't fork the mm and you can't fork groups to evade the selection
logic. If the root mem cgroup and leaf cgroups were truly comparable, it
also prevents both problems regardless of which cgroup the processes
attached to.

> > > It's not a new vector, and clearly nobody cares. This has never been
> > > brought up against the current design that I know of.
> >
> > As cgroup v2 becomes more popular, people will organize their cgroup
> > hierarchies for all controllers they need to use. We do this today, for
> > example, by attaching some individual consumers to child mem cgroups
> > purely for the rich statistics and vmscan stats that mem cgroup provides
> > without any limitation on those cgroups.
>
> There is no connecting tissue between what I wrote and what you wrote.
>

You're completely ignoring other usecases other than your own highly
specialized usecase. You may attach every user process on the system to
leaf cgroups and you may not have any users who have control over a
subtree. Other people do. And this patchset, as implemented, returns
very inaccurate results or allows users to intentionally or
unintentionally evade the oom killer just because they want to use
subcontainers.

Creating and attaching a subset of processes to either top-level
containers or subcontainers for either limitation by mem cgroup or for
statistics is a valid usecase, and one that is used in practice. Your
suggestion of avoiding that problem to work around the shortcomings of
this patchset by limiting how many subcontainers can be created by the
user is utterly ridiculous.

> We have a patch set that was developed, iterated and improved over 10+
> revisions, based on evaluating and weighing trade-offs. It's reached a
> state where the memcg maintainers are happy with it and don't have any
> concern about future extendabilty to cover more specialized setups.
>

It's also obviously untested in those 10+ revisions since it uses
oom_badness() for the root mem cgroup and not leaf mem cgroups which is
why it breaks any system where user processes are attached to the root mem
cgroup. See my example where a 1GB worker attached to the root mem cgroup
is preferred over a cgroup with 6GB workers. It may have been tested with
your own highly specialized usecase, but not with anything else, and the
author obviously admits its shortcomings.

> You've had nine months to contribute and shape this patch series
> productively, and instead resorted to a cavalcade of polemics,
> evasion, and repetition of truisms and refuted points. A ten paragraph
> proposal of vague ideas at this point is simply not a valid argument.
>

The patch series has gone through massive changes over the past nine
months and I've brought up three very specific concerns with its current
form that makes it non-extensible. I know the patchset has very
significant changes or rewrites in its history, but I'm only concerned
with its present form because it's not something that can easily be
extended later. We don't need useless files polutting the cgroup v2
filesystem for things that don't matter with other oom policies and we
don't need the mount option, actually.

> You can send patches to replace or improve on Roman's code and make
> the case for them.
>

I volunteered in the email thread where I proposed an alternative to
replace it, I'm actively seeking any feedback on that proposal to address
anybody else's concerns early on. Your participation in that would be
very useful.

Thanks.