Re: [patch -mm v2 1/3] mm, memcg: introduce per-memcg oom policy tunable

From: David Rientjes
Date: Thu Feb 01 2018 - 05:11:17 EST


On Wed, 31 Jan 2018, Michal Hocko wrote:

> > > > > root
> > > > > / | \
> > > > > A B C
> > > > > / \ / \
> > > > > D E F G
> > > > >
> > > > > Assume A: cgroup, B: oom_group=1, C: tree, G: oom_group=1
> > > > >
> > > >
> > > > At each level of the hierarchy, memory.oom_policy compares immediate
> > > > children, it's the only way that an admin can lock in a specific oom
> > > > policy like "tree" and then delegate the subtree to the user. If you've
> > > > configured it as above, comparing A and C should be the same based on the
> > > > cumulative usage of their child mem cgroups.
> > >
> It seems I am still not clear with my question. What kind of difference
> does policy=cgroup vs. none on A? Also what kind of different does it
> make when a leaf node has cgroup policy?
>

If A has an oom policy of "cgroup" it will be comparing the local usage of
D vs E, "tree" would be the same since neither descendants have child
cgroups. If A has an oom policy of "none" it would compare processes
attached to D and E and respect /proc/pid/oom_score_adj. It allows for
opting in and opting out of cgroup aware selection not only for the whole
system but also per subtree.

> > Hmm, I'm not sure why we would limit memory.oom_group to any policy. Even
> > if we are selecting a process, even without selecting cgroups as victims,
> > killing a process may still render an entire cgroup useless and it makes
> > sense to kill all processes in that cgroup. If an unlucky process is
> > selected with today's heursitic of oom_badness() or with a "none" policy
> > with my patchset, I don't see why we can't enable the user to kill all
> > other processes in the cgroup. It may not make sense for some trees, but
> > but I think it could be useful for others.
>
> My intuition screams here. I will think about this some more but I would
> be really curious about any sensible usecase when you want sacrifice the
> whole gang just because of one process compared to other processes or
> cgroups is too large. Do you see how you are mixing entities here?
>

It's a property of the workload that has nothing to do with selection.
Regardless of how a victim is selected, we need a victim. That victim may
be able to tolerate the loss of the process, and not even need to be the
largest memory hogging process based on /proc/pid/oom_score_adj (periodic
cleanups, logging, stat collection are what I'm most familiar with). It
may also be vital to the workload and it's better off to kill the entire
job, it's highly dependent on what the job is. There's a general usecase
for memory.oom_group behavior without any selection changes, we've had a
killall tunable for years and is used by many customers for the same
reason. There's no reason for it to be coupled, it can exist independent
of any cgroup aware selection.

> I do not understand. Get back to our example. Are you saying that G
> with none will enforce the none policy to C and root? If yes then this
> doesn't make any sense because you are not really able to delegate the
> oom policy down the tree at all. It would effectively make tree policy
> pointless.
>

The oom policy of G is pointless, it has no children cgroups. It can be
"none", "cgroup", or "tree", it's not the root of a subtree. (The oom
policy of the root mem cgroup is also irrelevant if there are no other
cgroups, same thing.) If G is oom, it kills the largest process or
everything if memory.oom_group is set, which in your example it is.

> I am skipping the rest of the following text because it is picking
> on details and the whole design is not clear to me. So could you start
> over documenting semantic and requirements. Ideally by describing:
> - how does the policy on the root of the OOM hierarchy controls the
> selection policy

If "none", there's no difference than Linus's tree right now. If
"cgroup", it enables cgroup aware selection: it compares all cgroups on
the system wrt local usage unless that cgroup has "tree" set in which case
its usage is hierarchical.

> - how does the per-memcg policy act during the tree walk - for both
> intermediate nodes and leafs

The oom policy is determined by the mem cgroup under oom, that is the root
of the subtree under oom and its policy dictates how to select a victim
mem cgroup.

> - how does the oom killer act based on the selected memcg

That's the point about memory.oom_group: once it has selected a cgroup (if
cgroup aware behavior is enabled for the oom subtree [could be root]), a
memory hogging process attached to that subtree is killed or everything is
killed if memory.oom_group is enabled.

> - how do you compare tasks with memcgs
>

You don't, I think the misunderstanding is what happens if the root of a
subtree is "cgroup", for example, and a descendant has "none" enabled.
The root is under oom, it is comparing cgroups :) "None" is only
effective if that subtree root is oom where process usage is considered.

The point is that all the functionality available in -mm is still
available, just dictate "cgroup" everywhere and make it a decision that
can change per subtree, if necessary, without any mount option that would
become obsoleted. Then, make memory.oom_group possible without any
specific selection policy since its useful on its own.

Let me give you a concrete example based on your earlier /admins,
/teachers, /students example. We oversubscribe the /students subtree in
the case where /admins and /teachers aren't using the memory. We say 100
students can use 1GB each, but the limit of /students is actually 200GB.
100 students using 1GB each won't cause a system oom, we control that by
the limit of /admins and /teachers. But we allow using memory that isn't
in use by /admins and /teachers if it's there, opening up overconsumers to
the possibility of oom kill. (This is a real world example with batch job
scheduling, it's anything but hypothetical.)

/students/michal logs in, and he has complete control over his subtree.
He's going to start several jobs, all in their own cgroups, with usage
well over 1GB, but if he's oom killed he wants the right thing oom killed.

Obviously this completely breaks down if the -mm functionality is used if
you have 10 jobs using 512MB each, because another student using more than
1GB who isn't using cgroups is going to be oom killed instead, although
you are using 5GB. We've discussed that ad nauseam, and is why I
introduced "tree".

But now look at the API. /students/michal is using child cgroups but
which selection policy is in effect? Will it kill the most memory hogging
process in his subtree or the most memory hogging process from the most
memory hogging cgroup? It's an important distinction because it's
directly based on how he constructs his hierarchy: if locked into one
selection logic, the least important job *must* be in the highest
consuming cgroup; otherwise, his /proc/pid/oom_score_adj is respected. He
*must* query the mount option.

But now let's say that memory.oom_policy is merged to give this control to
him to do per process, per cgroup, or per subtree oom killing based on how
he defines it. The mount option doesn't mean anything anymore, in fact,
it can mean the complete opposite of what actually happens. That's the
direct objection to the mount option. Since I have systems with thousands
of cgroups in hundreds of cgroups and over 100 workgroups that define,
sometimes very creatively, how to select oom victims, I'm an advocate for
an extensible interface that is useful for general purpose, doesn't remove
any functionality, and doesn't have contradicting specifications.