Re: [v8 0/4] cgroup-aware OOM killer

From: David Rientjes
Date: Tue Sep 19 2017 - 16:51:36 EST


On Mon, 18 Sep 2017, Michal Hocko wrote:

> > > > But then you just enforce a structural restriction on your configuration
> > > > because
> > > > root
> > > > / \
> > > > A D
> > > > /\
> > > > B C
> > > >
> > > > is a different thing than
> > > > root
> > > > / | \
> > > > B C D
> > > >
> > >
> > > I actually don't have a strong argument against an approach to select
> > > largest leaf or kill-all-set memcg. I think, in practice there will be
> > > no much difference.
> > >
> > > The only real concern I have is that then we have to do the same with
> > > oom_priorities (select largest priority tree-wide), and this will limit
> > > an ability to enforce the priority by parent cgroup.
> > >
> >
> > Yes, oom_priority cannot select the largest priority tree-wide for exactly
> > that reason. We need the ability to control from which subtree the kill
> > occurs in ancestor cgroups. If multiple jobs are allocated their own
> > cgroups and they can own memory.oom_priority for their own subcontainers,
> > this becomes quite powerful so they can define their own oom priorities.
> > Otherwise, they can easily override the oom priorities of other cgroups.
>
> Could you be more speicific about your usecase? What would be a
> problem If we allow to only increase priority in children (like other
> hierarchical controls).
>

For memcg constrained oom conditions, there is only a theoretical issue if
the subtree is not under the control of a single user and various users
can alter their priorities without knowledge of the priorities of other
children in the same subtree that is oom, or those values change without
knowledge of a child. I don't know of anybody that configures memory
cgroup hierarchies that way, though.

The problem is more obvious in system oom conditions. If we have two
top-level memory cgroups with the same "job" priority, they get the same
oom priority. The user who configures subcontainers is now always
targeted for oom kill in an "increase priority in children" policy.

The hierarchy becomes this:

root
/ \
A D
/ \ / | \
B C E F G

where A/memory.oom_priority == D/memory.oom_priority.

D wants to kill in order of E -> F -> G, but can't configure that if
B = A - 1 and C = B - 1. It also shouldn't need to adjust its own oom
priorities based on a hierarchy outside its control and which can change
at any time at the discretion of the user (with namespaces you may not
even be able to access it).

But also if A/memory.oom_priority = D/memory.oom_priority - 100, A is
preferred unless its subcontainers configure themselves in a way where
they have higher oom priority values than E, F, and G. That may yield
very different results when additional jobs get scheduled on the system
(and H tree) where the user has full control over their own oom
priorities, even when the value must only increase.