Re: can't oom-kill zap the victim's memory?

From: Tetsuo Handa
Date: Tue Sep 22 2015 - 10:30:24 EST


Oleg Nesterov wrote:
> On 09/22, Tetsuo Handa wrote:
> >
> > I imagined a dedicated kernel thread doing something like shown below.
> > (I don't know about mm->mmap management.)
> > mm->mmap_zapped corresponds to MMF_MEMDIE.
>
> No, it doesn't, please see below.
>
> > bool has_sigkill_task;
> > wait_queue_head_t kick_mm_zapper;
>
> OK, if this kthread is kicked by oom this makes more sense, but still
> doesn't look right at least initially.

Yes, I meant this kthread is kicked upon sending SIGKILL. But I forgot that

>
> Let me repeat, I do think we need MMF_MEMDIE or something like it before
> we do something more clever. And in fact I think this flag makes sense
> regardless.
>
> > static void mm_zapper(void *unused)
> > {
> > struct task_struct *g, *p;
> > struct mm_struct *mm;
> >
> > sleep:
> > wait_event(kick_remover, has_sigkill_task);
> > has_sigkill_task = false;
> > restart:
> > rcu_read_lock();
> > for_each_process_thread(g, p) {
> > if (likely(!fatal_signal_pending(p)))
> > continue;
> > task_lock(p);
> > mm = p->mm;
> > if (mm && mm->mmap && !mm->mmap_zapped && down_read_trylock(&mm->mmap_sem)) {
> ^^^^^^^^^^^^^^^
>
> We do not want mm->mmap_zapped, it can't work. We need mm->needs_zap
> set by oom_kill_process() and cleared after zap_page_range().
>
> Because otherwise we can not handle CLONE_VM correctly. Suppose that
> an innocent process P does vfork() and the child is killed but not
> exited yet. mm_zapper() can find the child, do zap_page_range(), and
> surprise its alive parent P which uses the same ->mm.

kill(P's-child, SIGKILL) does not kill P sharing the same ->mm.
Thus, mm_zapper() can be used for only OOM-kill case and
test_tsk_thread_flag(p, TIF_MEMDIE) should be used than
fatal_signal_pending(p).

>
> And if we rely on MMF_MEMDIE or mm->needs_zap or whaveter then
> for_each_process_thread() doesn't really make sense. And if we have
> a single MMF_MEMDIE process (likely case) then the unconditional
> _trylock is suboptimal.

I guess the more likely case is that the OOM victim successfully exits
before mm_zapper() finds it.

I thought that a dedicated kernel thread which scans the task list can do
deferred zapping by automatically retrying (in a few seconds interval ?)
when down_read_trylock() failed.

>
> Tetsuo, can't we do something simple which "obviously can't hurt at
> least" and then discuss the potential improvements?

No problem. I can wait for your version.

>
> And yes, yes, the "Kill all user processes sharing victim->mm" logic
> in oom_kill_process() doesn't 100% look right, at least wrt the change
> we discuss.

If we use test_tsk_thread_flag(p, TIF_MEMDIE), we will need to set
TIF_MEMDIE to the victim after sending SIGKILL to all processes sharing
the victim's mm. Well, the likely case that the OOM victim exits before
mm_zapper() finds it becomes not-so-likely case? Then, MMF_MEMDIE is
better than test_tsk_thread_flag(p, TIF_MEMDIE)...

>
> Oleg.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/