Re: [PATCH] mm,oom_kill: Close race window of needlessly selecting new victims.

From: Michal Hocko
Date: Wed Jun 21 2017 - 09:19:16 EST

On Tue 20-06-17 15:12:55, David Rientjes wrote:
> This doesn't prevent serial oom killing for either the system oom killer
> or for the memcg oom killer.
> The oom killer cannot detect tsk_is_oom_victim() if the task has either
> been removed from the tasklist or has already done cgroup_exit(). For
> memcg oom killings in particular, cgroup_exit() is usually called very
> shortly after the oom killer has sent the SIGKILL. If the oom reaper does
> not fail (for example by failing to grab mm->mmap_sem) before another
> memcg charge after cgroup_exit(victim), additional processes are killed
> because the iteration does not view the victim.
> This easily kills all processes attached to the memcg with no memory
> freeing from any victim.

It took me some time to decrypt the above but you are right. Pinning
mm_users will prevent exit path to exit_mmap and that can indeed cause
another premature oom killing because the task might be unhashed or
removed from the memcg before the oom reaper has a chance to reap the
task. Thanks for pointing this out. This means that we either have to
reimplement the unhashing/cgroup_exit for oom victims or get back to
allowing oom reaper to race with exit_mmap. The later sounds much more
easier to me.

I was offline last two days but I will revisit my original idea ASAP.

Michal Hocko