Re: [PATCH 08/10] mm, oom: task_will_free_mem should skip oom_reaped tasks

From: Michal Hocko
Date: Fri Jun 17 2016 - 08:57:01 EST


On Fri 17-06-16 20:35:38, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > From: Michal Hocko <mhocko@xxxxxxxx>
> >
> > 0-day robot has encountered the following:
> > [ 82.694232] Out of memory: Kill process 3914 (trinity-c0) score 167 or sacrifice child
> > [ 82.695110] Killed process 3914 (trinity-c0) total-vm:55864kB, anon-rss:1512kB, file-rss:1088kB, shmem-rss:25616kB
> > [ 82.706724] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26488kB
> > [ 82.715540] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
> > [ 82.717662] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
> > [ 82.725804] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:27296kB
> > [ 82.739091] oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:28148kB
> >
> > oom_reaper is trying to reap the same task again and again. This
> > is possible only when the oom killer is bypassed because of
> > task_will_free_mem because we skip over tasks with MMF_OOM_REAPED
> > already set during select_bad_process. Teach task_will_free_mem to skip
> > over MMF_OOM_REAPED tasks as well because they will be unlikely to free
> > anything more.
>
> I agree that we need to prevent same mm from being selected forever. But I
> feel worried about this patch. We are reaching a stage what purpose we set
> TIF_MEMDIE for. mark_oom_victim() sets TIF_MEMDIE on a thread with oom_lock
> held. Thus, if a mm which the TIF_MEMDIE thread is using is reapable (likely
> yes), __oom_reap_task() will likely be the next thread which will get that lock
> because __oom_reap_task() uses mutex_lock(&oom_lock) whereas other threads
> using that mm use mutex_trylock(&oom_lock). As a result, regarding CONFIG_MMU=y
> kernels, I guess that
>
> if (task_will_free_mem(current)) {
>
> shortcut in out_of_memory() likely becomes an useless condition. Since the OOM
> reaper will quickly reap mm and set MMF_OOM_REAPED on that mm and clear
> TIF_MEMDIE, other threads using that mm will fail to get TIF_MEMDIE (because
> task_will_free_mem() will start returning false due to this patch) and proceed
> to next OOM victim selection.

I suspect you are overthinking this. Just try to imagine what would have
to happen in order to get another victim:

CPU1 CPU2
__alloc_pages_slowpath
__alloc_pages_may_oom
mutex_lock(oom_lock)
out_of_memory
task_will_free_mem
mark_oom_victim
wake_oom_reaper
__oom_reap_task
mutex_unlock(oom_lock)
mutex_lock(oom_lock)
unmap_page_range # For all VMAs
tlb_finish_mmu
set_bit(MMF_OOM_REAPED)
mutex_unlock(oom_lock)

<back in allocator with access to memory reserves>

__alloc_pages_may_oom
mutex_lock()
out_of_memory
exit_oom_victim
task_will_free_mem # False

There will a large window when the current will have TIF_MEMDIE and
there will be memory freed by the oom reaper to get us out of the
mess. Even if that wasn't the case and the address space is not really
reapable then the victim had quite some time to use memory reserves and
move on. And if even that didn't help then it is really hard to judge
whether the victim would benefit from more time.

That being said even if the TIF_MEMDIE wouldn't be used (which is
unlikely because tearing down the address space is likely to take some
time) then the reaper will be freeing memory in the background to help
get away from OOM.

Or did you have any other scenario in mind?

> The comment

> * That thread will now get access to memory reserves since it has a
> * pending fatal signal.
>
> in oom_kill_process() became almost dead. Since we need a short delay in order
> to allow get_page_from_freelist() to allocate from memory reclaimed by
> __oom_reap_task(), this patch might increase possibility of excessively
> preventing OOM-killed threads from using ALLOC_NO_WATERMARKS via TIF_MEMDIE
> and increase possibility of needlessly selecting next OOM victim.

It seems that you are assuming that the oom reaper will not reclaim much
memory. Even if that was the case (e.g. large amount of memory which is
not not directly bound to mm like socket and other kernel buffers but
even then this would be hardly a new problem introduced by this patch
because many of those resources are deallocated past exit_mm).

> So, maybe we shouldn't let this shortcut to return false as soon as
> MMF_OOM_REAPED is set.

What would be an alternative?

--
Michal Hocko
SUSE Labs