Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy tunable

From: David Rientjes
Date: Wed Jan 17 2018 - 17:14:58 EST


On Wed, 17 Jan 2018, Tejun Heo wrote:

> Hello, David.
>

Hi Tejun!

> > The behavior of killing an entire indivisible memory consumer, enabled
> > by memory.oom_group, is an oom policy itself. It specifies that all
>
> I thought we discussed this before but maybe I'm misremembering.
> There are two parts to the OOM policy. One is victim selection, the
> other is the action to take thereafter.
>
> The two are different and conflating the two don't work too well. For
> example, please consider what should be given to the delegatee when
> delegating a subtree, which often is a good excercise when designing
> these APIs.
>
> When a given workload is selected for OOM kill (IOW, selected to free
> some memory), whether the workload can handle individual process kills
> or not is the property of the workload itself. Some applications can
> safely handle some of its processes picked off and killed. Most
> others can't and want to be handled as a single unit, which makes it a
> property of the workload.
>

Yes, this is a valid point. The policy of "tree" and "all" are identical
policies and then the mechanism differs wrt to whether one process is
killed or all eligible processes are killed, respectively. My motivation
for this was to avoid having two different tunables, especially because
later we'll need a way for userspace to influence the decisionmaking to
protect (bias against) important subtrees. What would really be nice is
cgroup.subtree_control-type behavior where we could effect a policy and a
mechanism at the same time. It's not clear how that would be handled to
allow only one policy and one mechanism, however, in a clean way. The
simplest for the user would be a new file, to specify the mechanism and
leave memory.oom_policy alone. Would another file really be warranted?
Not sure.

> That makes sense in the hierarchy too because whether one process or
> the whole workload is killed doesn't infringe upon the parent's
> authority over resources which in turn implies that there's nothing to
> worry about how the parent's groupoom setting should constrain the
> descendants.
>
> OOM victim selection policy is a different beast. As you've mentioned
> multiple times, especially if you're worrying about people abusing OOM
> policies by creating sub-cgroups and so on, the policy, first of all,
> shouldn't be delegatable and secondly should have meaningful
> hierarchical restrictions so that a policy that an ancestor chose
> can't be nullified by a descendant.
>

The goal here was to require a policy of either "tree" or "all" that the
user can't change. They are allowed to have their own oom policies
internal to their subtree, however, for oom conditions in that subtree
alone. However, if the common ancestor hits its limit, it is forced to
either be "tree" or "all" and require hierarchical usage to be considered
instead of localized usage. Either "tree" or "all" is appropriate, and
this may be why you brought up the point about separating them out, i.e.
the policy can be demanded by the common ancestor but the actual mechanism
that the oom killer uses, kill either a single process or the full cgroup,
is left to the user depending on their workload. That sounds reasonable
and I can easily separate the two by introducing a new file, similar to
memory.oom_group but in a more extensible way so that it is not simply a
selection of either full cgroup kill or single process.

> I'm not necessarily against adding hierarchical victim selection
> policy tunables; however, I am skeptical whether static tunables on
> cgroup hierarchy (including selectable policies) can be made clean and
> versatile enough, especially because the resource hierarchy doesn't
> necessarily, or rather in most cases, match the OOM victim selection
> decision tree, but I'd be happy to be proven wrong.
>

Right, and I think that giving users control over their subtrees is a
powerful tool and one that can lead to very effective use of the cgroup v2
hierarchy. Being able to circumvent the oom selection by creating child
cgroups is certainly something that can trivially be prevented. The
argument that users can currently divide their entire processes into
several different smaller processes to circumvent today's heuristic
doesn't mean we can't have "tree"-like comparisons between cgroups to
address that issue itself since all processes charge to the tree itself.

I became convinced of this when I saw the real-world usecases that would
use such a feature on cgroup v2: we want to have hierarchical usage for
comparison when full subtrees are dedicated to individual consumers, for
example, and local mem cgroup usage for comparison when using hierarchies
for top-level /admins and /students cgroups for which Michal provided an
example. These can coexist on systems and it's clear that there needs to
be a system-wide policy decision for the cgroup aware oom killer (the idea
behind the current mount option, which isn't needed anymore). So defining
the actual policy, and mechanism as you pointed out, for subtrees is a
very powerful tool, it's extensible, doesn't require a system to either
fully enable or disable, and doesn't require a remount of cgroup v2 to
change.

> Without explicit configurations, the only thing the OOM killer needs
> to guarantee is that the system can make forward progress. We've
> always been tweaking victim selection with or without cgroup and
> absolutely shouldn't be locked into a specific heuristics. The
> heuristics is an implementaiton detail subject to improvements.
>
> To me, your patchset actually seems to demonstrate that these are
> separate issues. The goal of groupoom is just to kill logical units
> as cgroup hierarchy can inform the kernel of how workloads are
> composed in the userspace. If you want to improve victim selection,
> sure, please go ahead, but your argument that groupoom can't be merged
> because of victim selection policy doesn't make sense to me.
>

memory.oom_group, the mechanism behind what the oom killer chooses to do
after victim selection, is not implemented without the selection heuristic
comparing cgroups as indivisible memory consumers. It could be done first
prior to introducing the new selection criteria. We don't have patches
for that right now, because Roman's work introduces it in the opposite
order. If it is acceptable to add a separate file solely for this
purpose, it's rather trivial to do. My other thought was some kind of
echo "hierarchy killall" > memory.oom_policy where the policy and an
(optional) mechanism could be specified. Your input on the actual
tunables would be very valuable.