Re: oom && coredump

From: Oleg Nesterov
Date: Thu Nov 27 2014 - 12:48:08 EST


On 11/27, Michal Hocko wrote:
>
> On Mon 20-10-14 21:56:18, Oleg Nesterov wrote:
> > speaking of "partial" fixes for oom problems...
> >
> > Perhaps the patch below makes sense? Sure, it is racy, but probably
> > better than nothing. And in any case (imo) this SIGNAL_GROUP_COREDUMP
> > check doesn't look bad, the coredumping task can consume more memory,
> > and we can't assume it is going to actually exit soon.
>
> I am not familiar with this area much (it is too scary...).
> I guess the issue is that OOM killer might try to kill a task which is
> currently in the middle of cordumping which is not killable, right? And
> if it is blocked on memory allocation then we are effectively dead
> locked. Right?
>
> Wouldn't it be better to make coredumping killable? Is this even
> possible?

It is already killable.

The problem is that oom-killer assumes that the PF_EXITING task should
exit and release its memory "soon", so oom_scan_process_thread() returns
OOM_SCAN_ABORT.

This is obviously wrong if this PF_EXITING task participates in coredump
and sleeps in exit_mm(). (iirc there are other issues with mt tasks, but
lets not discuss this now). This task won't exit until the coredumping
completes, and this can take o lot of time, more memory, etc.


> > And at least we can kill that ugly and wrong ptrace check.
>
> Why is the ptrace check wrong?

It was added in reply to exploit I sent. But:

- It doesn't (and can't) really work, it can only detect this particular
case and the same exploit still blocks oom-killer with the minimal
modifications.

- Once again, this has nothing to do with ptrace. That exploit used
ptrace only to control (freeze) the coredumping process, the coredumping
can "hang" because of other reasons.

- It is no longer needed after this patch, the coredumping process will
be killed.

So I think the patch below makes sense anyway. Although I should probably
split it and remove PT_TRACE_EXIT in 2/2.

> PF_EXITING should be set after
> ptrace_event(PTRACE_EVENT_EXIT, code). But then I can see
> unlikely(tsk->flags & PF_EXITING) check right after PTRACE_EVENT_EXIT
> notification.

Note that it checks task->group_leader, not task. But see above. This makes
no sense.

> > --- x/mm/oom_kill.c
> > +++ x/mm/oom_kill.c
> > @@ -254,6 +254,12 @@ static enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> > }
> > #endif
> >
> > +static inline bool task_will_free_mem(struct task_struct *task)
> > +{
> > + return (task->flags & PF_EXITING) &&
> > + !(task->signal->flags & SIGNAL_GROUP_COREDUMP);
> > +}
> > +
> > enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
> > unsigned long totalpages, const nodemask_t *nodemask,
> > bool force_kill)
> > @@ -281,14 +287,9 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
> > if (oom_task_origin(task))
> > return OOM_SCAN_SELECT;
> >
> > - if (task->flags & PF_EXITING && !force_kill) {
> > - /*
> > - * If this task is not being ptraced on exit, then wait for it
> > - * to finish before killing some other task unnecessarily.
> > - */
> > - if (!(task->group_leader->ptrace & PT_TRACE_EXIT))
> > - return OOM_SCAN_ABORT;
> > - }
> > + if (task_will_free_mem(task) && !force_kill)
> > + return OOM_SCAN_ABORT;
> > +
> > return OOM_SCAN_OK;
> > }
> >
> > @@ -426,7 +427,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
> > * If the task is already exiting, don't alarm the sysadmin or kill
> > * its children or threads, just set TIF_MEMDIE so it can die quickly
> > */
> > - if (p->flags & PF_EXITING) {
> > + if (task_will_free_mem(p)) {
> > set_tsk_thread_flag(p, TIF_MEMDIE);
> > put_task_struct(p);
> > return;
> > @@ -632,7 +633,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> > * select it. The goal is to allow it to allocate so that it may
> > * quickly exit and free its memory.
> > */
> > - if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
> > + if (fatal_signal_pending(current) || task_will_free_mem(current)) {
> > set_thread_flag(TIF_MEMDIE);
> > return;
> > }
> >
>
> --
> Michal Hocko
> SUSE Labs

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/