Re: [v8 0/4] cgroup-aware OOM killer

From: David Rientjes
Date: Fri Sep 22 2017 - 16:53:56 EST


On Thu, 21 Sep 2017, Johannes Weiner wrote:

> > The issue is that if you opt-in to the new feature, then you are forced to
> > change /proc/pid/oom_score_adj of all processes attached to a cgroup that
> > you do not want oom killed based on size to be oom disabled.
>
> You're assuming that most people would want to influence the oom
> behavior in the first place. I think the opposite is the case: most
> people don't care as long as the OOM killer takes the intent the user
> has expressed wrt runtime containerization/grouping into account.
>

If you do not want to influence the oom behavior, do not change
memory.oom_priority from its default. It's that simple.

> > The kernel provides no other remedy without oom priorities since the
> > new feature would otherwise disregard oom_score_adj.
>
> As of v8, it respects this setting and doesn't kill min score tasks.
>

That's the issue. To protect a memory cgroup from being oom killed in a
system oom condition, you need to change oom_score_adj of *all* processes
attached to be oom disabled. Then, you have a huge problem in memory
cgroup oom conditions because nothing can be killed in that hierarchy
itself.

> > The patchset compares memory cgroup size relative to sibling cgroups only,
> > the same comparison for memory.oom_priority. There is a guarantee
> > provided on how cgroup size is compared in select_victim_memcg(), it
> > hierarchically accumulates the "size" from leaf nodes up to the root memcg
> > and then iterates the tree comparing sizes between sibling cgroups to
> > choose a victim memcg. That algorithm could be more elaborately described
> > in the documentation, but we simply cannot change the implementation of
> > select_victim_memcg() later even without oom priorities since users cannot
> > get inconsistent results after opting into a feature between kernel
> > versions. I believe the selection criteria should be implemented to be
> > deterministic, as select_victim_memcg() does, and the documentation should
> > fully describe what the selection criteria is, and then allow the user to
> > decide.
>
> I wholeheartedly disagree. We have changed the behavior multiple times
> in the past. In fact, you have arguably done the most drastic changes
> to the algorithm since the OOM killer was first introduced. E.g.
>
> a63d83f427fb oom: badness heuristic rewrite
>
> And that's completely fine. Because this thing is not a resource
> management tool for userspace, it's the kernel saving itself. At best
> in a manner that's not too surprising to userspace.
>

When I did that, I had to add /proc/pid/oom_score_adj to allow userspace
to influence selection. We came up with /proc/pid/oom_score_adj when
working with kde, openssh, chromium, and udev because they cared about the
ability to influence the decisionmaking. I'm perfectly happy with the new
heuristic presented in this patchset, I simply want userspace to be able
to influence it, if it desires. Requiring userspace to set all processes
to be oom disabled to protect a hierarchy is totally and completely
broken. It livelocks the memory cgroup if it is oom itself.

> To me, your argument behind the NAK still boils down to "this doesn't
> support my highly specialized usecase." But since it doesn't prohibit
> your usecase - which isn't even supported upstream, btw - this really
> doesn't carry much weight.
>
> I'd say if you want configurability on top of Roman's code, please
> submit patches and push the case for these in a separate effort.
>

Roman implemented memory.oom_priority himself, it has my Tested-by, and it
allows users who want to protect high priority memory cgroups from using
the size based comparison for all other cgroups that we very much desire.