Re: [PATCH -mm v2 1/3] mm/oom_kill: remove the wrong fatal_signal_pending() check in oom_kill_process()

From: Michal Hocko
Date: Fri Oct 02 2015 - 08:11:49 EST


On Fri 02-10-15 20:32:41, Tetsuo Handa wrote:
> Oleg Nesterov wrote:
> > On 10/01, Michal Hocko wrote:
> > >
> > > zap_process will add SIGKILL to all threads but the
> > > current which will go on without being killed and if this is not a
> > > thread group leader then we would miss it.
> >
> > Yes. And note that de_thread() does the same. Speaking of oom-killer
> > this is mostly fine, the execing thread is going to release its old
> > ->mm and it has already passed the copy_strings() stage which can use
> > a lot more memory.
>
> So, we have the same wrong fatal_signal_pending() check in out_of_memory()
>
> /*
> * If current has a pending SIGKILL or is exiting, then automatically
> * select it. The goal is to allow it to allocate so that it may
> * quickly exit and free its memory.
> *
> * But don't select if current has already released its mm and cleared
> * TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
> */
> if (current->mm &&
> (fatal_signal_pending(current) || task_will_free_mem(current))) {
> mark_oom_victim(current);
> return true;
> }
>
> because it is possible that T starts the coredump, T sends SIGKILL to P,
> P calls out_of_memory() on GFP_FS allocation, P misses to set SIGKILL on T?

So what? P will get an access to memory reserves to move on with the
allocation. This has nothing to do with other thread. If the current
thread (P) doesn't release any memory we would get to regular oom killer
path and eventually send the signal.

This is a simple heuristic to prevent from unnecessary killing. If it
doesn't help we should still be able to kill something. If you really
want to prevent from an unlikely case where the current is the only task
in the OOM path and TIF_MEMDIE didn't help then disable the heuristic if
the current already has TIF_MEMDIE set.

> Since T sends SIGKILL to all clone(CLONE_VM) tasks upon coredump, P needs
> to do

It does that only to all threads in the _same_ thread group AFAIU.

>
> rcu_read_lock();
> for_each_process(p) {
> if (!process_shares_mm(p, current->mm))
> continue;
> if (unlikely(p->flags & PF_KTHREAD))
> continue;
> if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
> continue;
>
> do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
> }
> rcu_read_unlock();
>
> after mark_oom_victim(current) in case T is not in the same thread group?

What does this have to do with coredumping? Have you fallen into the
same confusion trap I did yesterday?

> If yes, what happens if some task failed to receive SIGKILL due to
> p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN condition?
> Will we hit mm->mmap_sem livelock?

I am not sure which livelock you mean here (exit_mm not being able to
proceed because some of the task is holding mmap_sem for write and
looping in the allocator)?

Anyway, having tasks sharing mm but having incompatible OOM_SCORE_ADJ_MIN
is basically a misconfiguration IMHO. I wouldn't lose sleep over it to
be honest. And yes if one process is pinning the address space then our
chances.
--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/