Re: Re: [patch] mm, oom: prevent additional oom kills before memory is freed

From: Tetsuo Handa
Date: Fri Jun 16 2017 - 06:27:34 EST


Michal Hocko wrote:
> On Fri 16-06-17 09:54:34, Tetsuo Handa wrote:
> [...]
> > And the patch you proposed is broken.
>
> Thanks for your testing!
>
> > ----------
> > [ 161.846202] Out of memory: Kill process 6331 (a.out) score 999 or sacrifice child
> > [ 161.850327] Killed process 6331 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > [ 161.858503] ------------[ cut here ]------------
> > [ 161.861512] kernel BUG at mm/memory.c:1381!
>
> BUG_ON(addr >= end) suggests our vma has trimmed. I guess I see what is
> going on here.
> __oom_reap_task_mm exit_mmap
> free_pgtables
> up_write(mm->mmap_sem)
> down_read_trylock(&mm->mmap_sem)
> remove_vma
> unmap_page_range
>
> So we need to extend the mmap_sem coverage. See the updated diff (not
> the full proper patch yet).

That diff is still wrong. We need to prevent __oom_reap_task_mm() from calling
unmap_page_range() when __mmput() already called exit_mm(), by setting/checking
MMF_OOM_SKIP like shown below.

diff --git a/kernel/fork.c b/kernel/fork.c
index e53770d..5ef715c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -902,6 +902,11 @@ static inline void __mmput(struct mm_struct *mm)
exit_aio(mm);
ksm_exit(mm);
khugepaged_exit(mm); /* must run before exit_mmap */
+ /*
+ * oom reaper might race with exit_mmap so make sure we won't free
+ * page tables under its feet
+ */
+ down_write(&mm->mmap_sem);
exit_mmap(mm);
mm_put_huge_zero_page(mm);
set_mm_exe_file(mm, NULL);
@@ -913,6 +918,7 @@ static inline void __mmput(struct mm_struct *mm)
if (mm->binfmt)
module_put(mm->binfmt->module);
set_bit(MMF_OOM_SKIP, &mm->flags);
+ up_write(&mm->mmap_sem);
mmdrop(mm);
}

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 04c9143..98cca19 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -493,12 +493,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
goto unlock_oom;
}

- /*
- * increase mm_users only after we know we will reap something so
- * that the mmput_async is called only when we have reaped something
- * and delayed __mmput doesn't matter that much
- */
- if (!mmget_not_zero(mm)) {
+ if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
up_read(&mm->mmap_sem);
goto unlock_oom;
}
@@ -537,13 +532,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
K(get_mm_counter(mm, MM_FILEPAGES)),
K(get_mm_counter(mm, MM_SHMEMPAGES)));
up_read(&mm->mmap_sem);
-
- /*
- * Drop our reference but make sure the mmput slow path is called from a
- * different context because we shouldn't risk we get stuck there and
- * put the oom_reaper out of the way.
- */
- mmput_async(mm);
unlock_oom:
mutex_unlock(&oom_lock);
return ret;



>
> > Please carefully consider the reason why there is VM_BUG_ON() in __mmput(),
> > and clarify in your patch that what are possible side effects of racing
> > uprobe_clear_state()/exit_aio()/ksm_exit()/exit_mmap() etc. with
> > __oom_reap_task_mm()
>
> Yes that definitely needs to be checked. We basically rely on the racing
> part of the __mmput to not modify the address space. oom_reaper doesn't
> touch any vma state except it unmaps pages which can run in parallel.
> exit_aio->kill_ioctx seemingly does vm_munmap but it a) uses the
> mmap_sem for write and b) it doesn't actually unmap because exit_aio
> does ctx->mmap_size = 0. {ksm,khugepaged}_exit just do some houskeeping
> which is not modifying the address space. I hope I will find some more
> time to work on this next week. Additional test would be highly
> appreciated of course.

Since the OOM reaper does not reap hugepages, khugepaged_exit() part could be
safe. But ksm_exit() part might interfere. If it is guaranteed to be safe,
what will go wrong if we move uprobe_clear_state()/exit_aio()/ksm_exit() etc.
to just before mmdrop() (i.e. after setting MMF_OOM_SKIP) ?