Re: [patch] mm, oom: prevent additional oom kills before memory is freed
From: Michal Hocko
Date: Fri Jun 16 2017 - 10:13:08 EST
On Fri 16-06-17 21:22:20, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > OK, could you play with the patch/idea suggested in
> > http://lkml.kernel.org/r/20170615122031.GL1486@xxxxxxxxxxxxxx?
>
> I think we don't need to worry about mmap_sem dependency inside __mmput().
> Since the OOM killer checks for !MMF_OOM_SKIP mm rather than TIF_MEMDIE thread,
> we can keep the OOM killer disabled until we set MMF_OOM_SKIP to the victim's mm.
> That is, elevating mm_users throughout the reaping procedure does not cause
> premature victim selection, even after TIF_MEMDIE is cleared from the victim's
> thread. Then, we don't need to use down_write()/up_write() for non OOM victim's mm
> (nearly 100% of exit_mmap() calls), and can force partial reaping of OOM victim's mm
> (nearly 0% of exit_mmap() calls) before __mmput() starts doing exit_aio() etc.
> Patch is shown below. Only compile tested.
Yes, that would be another approach.
> include/linux/sched/coredump.h | 1 +
> mm/oom_kill.c | 80 ++++++++++++++++++++----------------------
> 2 files changed, 40 insertions(+), 41 deletions(-)
>
> diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
> index 98ae0d0..6b6237b 100644
> --- a/include/linux/sched/coredump.h
> +++ b/include/linux/sched/coredump.h
> @@ -62,6 +62,7 @@ static inline int get_dumpable(struct mm_struct *mm)
> * on NFS restore
> */
> //#define MMF_EXE_FILE_CHANGED 18 /* see prctl_set_mm_exe_file() */
> +#define MMF_OOM_REAPING 18 /* mm is supposed to be reaped */
A new flag is not really needed. We can increase it for _each_ reapable
oom victim.
> @@ -658,6 +643,13 @@ static void mark_oom_victim(struct task_struct *tsk)
> if (!cmpxchg(&tsk->signal->oom_mm, NULL, mm))
> mmgrab(tsk->signal->oom_mm);
>
> +#ifdef CONFIG_MMU
> + if (!test_bit(MMF_OOM_REAPING, &mm->flags)) {
> + set_bit(MMF_OOM_REAPING, &mm->flags);
> + mmget(mm);
> + }
> +#endif
This would really need a big fat warning explaining why we do not need
mmget_not_zero. We rely on exit_mm doing both mmput and tsk->mm = NULL
under the task_lock and mark_oom_victim is called under this lock as
well and task_will_free_mem resp. find_lock_task_mm makes sure we do not
even consider tasks wihout mm.
I agree that a solution which is fully contained inside the oom proper
would be preferable to touching __mmput path.
--
Michal Hocko
SUSE Labs