Re: [v5 2/4] mm, oom: cgroup-aware OOM killer
From: Roman Gushchin
Date: Tue Aug 15 2017 - 08:16:34 EST
On Mon, Aug 14, 2017 at 03:42:54PM -0700, David Rientjes wrote:
> On Mon, 14 Aug 2017, Roman Gushchin wrote:
> > +
> > +static long oom_evaluate_memcg(struct mem_cgroup *memcg,
> > + const nodemask_t *nodemask)
> > +{
> > + struct css_task_iter it;
> > + struct task_struct *task;
> > + int elegible = 0;
> > +
> > + css_task_iter_start(&memcg->css, 0, &it);
> > + while ((task = css_task_iter_next(&it))) {
> > + /*
> > + * If there are no tasks, or all tasks have oom_score_adj set
> > + * to OOM_SCORE_ADJ_MIN and oom_kill_all_tasks is not set,
> > + * don't select this memory cgroup.
> > + */
> > + if (!elegible &&
> > + (memcg->oom_kill_all_tasks ||
> > + task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN))
> > + elegible = 1;
>
> I'm curious about the decision made in this conditional and how
> oom_kill_memcg_member() ignores task->signal->oom_score_adj. It means
> that memory.oom_kill_all_tasks overrides /proc/pid/oom_score_adj if it
> should otherwise be disabled.
>
> It's undocumented in the changelog, but I'm questioning whether it's the
> right decision. Doesn't it make sense to kill all tasks that are not oom
> disabled, and allow the user to still protect certain processes by their
> /proc/pid/oom_score_adj setting? Otherwise, there's no way to do that
> protection without a sibling memcg and its own reservation of memory. I'm
> thinking about a process that governs jobs inside the memcg and if there
> is an oom kill, it wants to do logging and any cleanup necessary before
> exiting itself. It seems like a powerful combination if coupled with oom
> notification.
Good question!
I think, that an ability to override any oom_score_adj value and get all tasks
killed is more important, than an ability to kill all processes with some
exceptions.
In your example someone still needs to look after the remaining process,
and kill it after some timeout, if it will not quit by itself, right?
The special treatment of the -1000 value (without oom_kill_all_tasks)
is required only to not to break the existing setups.
Generally, oom_score_adj should have a meaning only on a cgroup level,
so extending it to the system level doesn't sound as a good idea.
>
> Also, s/elegible/eligible/
Shame on me :)
Will fix, thanks!
>
> Otherwise, looks good!
Great!
Thank you for the reviewing and testing.
Roman