Re: [v9 3/5] mm, oom: cgroup-aware OOM killer

From: Roman Gushchin
Date: Tue Oct 03 2017 - 10:09:37 EST


On Tue, Oct 03, 2017 at 03:36:23PM +0200, Michal Hocko wrote:
> On Tue 03-10-17 13:37:21, Roman Gushchin wrote:
> > On Tue, Oct 03, 2017 at 01:48:48PM +0200, Michal Hocko wrote:
> [...]
> > > Wrt. to the implicit inheritance you brought up in a separate email
> > > thread [1]. Let me quote
> > > : after some additional thinking I don't think anymore that implicit
> > > : propagation of oom_group is a good idea. Let me explain: assume we
> > > : have memcg A with memory.max and memory.oom_group set, and nested
> > > : memcg A/B with memory.max set. Let's imagine we have an OOM event if
> > > : A/B. What is an expected system behavior?
> > > : We have OOM scoped to A/B, and any action should be also scoped to A/B.
> > > : We really shouldn't touch processes which are not belonging to A/B.
> > > : That means we should either kill the biggest process in A/B, either all
> > > : processes in A/B. It's natural to make A/B/memory.oom_group responsible
> > > : for this decision. It's strange to make the depend on A/memory.oom_group, IMO.
> > > : It really makes no sense, and makes oom_group knob really hard to describe.
> > > :
> > > : Also, after some off-list discussion, we've realized that memory.oom_knob
> > > : should be delegatable. The workload should have control over it to express
> > > : dependency between processes.
> > >
> > > OK, I have asked about this already but I am not sure the answer was
> > > very explicit. So let me ask again. When exactly a subtree would
> > > disagree with the parent on oom_group? In other words when do we want a
> > > different cleanup based on the OOM root? I am not saying this is wrong
> > > I am just curious about a practical example.
> >
> > Well, I do not have a practical example right now, but it's against the logic.
> > Any OOM event has a scope, and group_oom knob is applied for OOM events
> > scoped to the cgroup or any ancestors (including system as a whole).
> > So, applying it implicitly to OOM scoped to descendant cgroups makes no sense.
> > It's a strange configuration limitation, and I do not see any benefits:
> > it doesn't provide any new functionality or guarantees.
>
> Well, I guess I agree. I was merely interested about consequences when
> the oom behavior is different depending on which layer it happens. Does
> it make sense to cleanup the whole hierarchy while any subtree would
> kill a single task if the oom happened there?

By setting or not setting the oom_group knob a user is expressing
the readiness to handle the OOM by itself, e.g. looking at cgroup events,
restarting killed tasks, etc.

If workload is complex, and has some sub-parts with their own memory
constraints, it's quite possible, that it's ready to restart these parts,
but not a random process killed by the global OOM. This is actually a proper
replacement for setting oom_score_adj: let say there is memcg A,
which contains some control stuff in A/C, and several sub-workloads
A/W1, A/W2, etc. In case of global OOM, caused by system miss-configuration,
or, say, a memory leak in the control stuff, it makes perfect sense to
kill A as a whole, so we can set A/memory.oom_groups to 1.

But if there is a memory shortage in one of the workers (A/W1, for instance),
it's quite possible that killing everything is excessive. So, a user has
the freedom to decide what's the proper way to handle OOM.

>
> > Even if we don't have practical examples, we should build something less
> > surprising for a user, and I don't understand why oom_group should be inherited.
>
> I guess we want to inherit the value on the memcg creation but I agree
> that enforcing parent setting is weird. I will think about it some more
> but I agree that it is saner to only enforce per memcg value.

I'm not against, but we should come up with a good explanation, why we're
inheriting it; or not inherit.

>
> > > > Tasks with oom_score_adj set to -1000 are considered as unkillable.
> > > >
> > > > The root cgroup is treated as a leaf memory cgroup, so it's score
> > > > is compared with other leaf and oom_group memory cgroups.
> > > > The oom_group option is not supported for the root cgroup.
> > > > Due to memcg statistics implementation a special algorithm
> > > > is used for estimating root cgroup oom_score: we define it
> > > > as maximum oom_score of the belonging tasks.
> > >
> > > [1] http://lkml.kernel.org/r/20171002124712.GA17638@xxxxxxxxxxxxxxxxxxxxxxxxxxx
> > >
> > > [...]
> > > > +static long memcg_oom_badness(struct mem_cgroup *memcg,
> > > > + const nodemask_t *nodemask,
> > > > + unsigned long totalpages)
> > > > +{
> > > > + long points = 0;
> > > > + int nid;
> > > > + pg_data_t *pgdat;
> > > > +
> > > > + /*
> > > > + * We don't have necessary stats for the root memcg,
> > > > + * so we define it's oom_score as the maximum oom_score
> > > > + * of the belonging tasks.
> > > > + */
> > >
> > > Why not a sum of all tasks which would more resemble what we do for
> > > other memcgs? Sure this would require ignoring oom_score_adj so
> > > oom_badness would have to be tweaked a bit (basically split it into
> > > __oom_badness which calculates the value without the bias and
> > > oom_badness on top adding the bias on top of the scaled value).
> >
> > We've discussed it already: calculating the sum is tricky, as tasks
> > are sharing memory (and the mm struct(. As I remember, you suggested
> > using maximum to solve exactly this problem, and I think it's a good
> > approximation. Assuming that tasks in the root cgroup likely have
> > nothing in common, and we don't support oom_group for it, looking
> > at the biggest task makes perfect sense: we're exactly comparing
> > killable entities.
>
> Please add a comment explaining that. I hope we can make root memcg less
> special eventually. It shouldn't be all that hard. We already have per
> LRU numbers and we only use few counters which could be accounted to the
> root memcg as well. Counters should be quite cheap.

Sure, this is my hope too.

>
> [...]
>
> > > > @@ -962,6 +968,48 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
> > > > __oom_kill_process(victim);
> > > > }
> > > >
> > > > +static int oom_kill_memcg_member(struct task_struct *task, void *unused)
> > > > +{
> > > > + if (!tsk_is_oom_victim(task)) {
> > >
> > > How can this happen?
> >
> > We do start with killing the largest process, and then iterate over all tasks
> > in the cgroup. So, this check is required to avoid killing tasks which are
> > already in the termination process.
>
> Do you mean we have tsk_is_oom_victim && MMF_OOM_SKIP == T?

No, just tsk_is_oom_victim. We're are killing the biggest task, and then _all_
tasks. This is a way to skip the biggest task, and do not kill it again.

>
> > >
> > > > + get_task_struct(task);
> > > > + __oom_kill_process(task);
> > > > + }
> > > > + return 0;
> > > > +}
> > > > +
> > > > +static bool oom_kill_memcg_victim(struct oom_control *oc)
> > > > +{
> > > > + static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> > > > + DEFAULT_RATELIMIT_BURST);
> > > > +
> > > > + if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
> > > > + return oc->chosen_memcg;
> > > > +
> > > > + /* Always begin with the task with the biggest memory footprint */
> > > > + oc->chosen_points = 0;
> > > > + oc->chosen_task = NULL;
> > > > + mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> > > > +
> > > > + if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
> > > > + goto out;
> > > > +
> > > > + if (__ratelimit(&oom_rs))
> > > > + dump_header(oc, oc->chosen_task);
> > >
> > > Hmm, does the full dump_header really apply for the new heuristic? E.g.
> > > does it make sense to dump_tasks()? Would it make sense to print stats
> > > of all eligible memcgs instead?
> >
> > Hm, this is a tricky part: the dmesg output is at some point a part of ABI,
>
> People are parsing oom reports but I disagree this is an ABI of any
> sort. The report is closely tight to the particular implementation and
> as such it has changed several times over the time.
>
> > but is also closely connected with the implementation. So I would suggest
> > to postpone this until we'll get more usage examples and will better
> > understand what information we need.
>
> I would drop tasks list at least because that is clearly misleading in
> this context because we are not selecting from all tasks. We are
> selecting between memcgs. The memcg information can be added in a
> separate patch of course.

Let's postpone it until we'll land the rest of the patchset.

Thank you!