Re: [PATCH 5/5] mm, oom_reaper: implement OOM victims queuing

From: Michal Hocko
Date: Mon Feb 15 2016 - 15:15:45 EST


On Sun 07-02-16 00:33:38, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Sat 06-02-16 14:54:24, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > > > > But if we consider non system-wide OOM events, it is not very unlikely to hit
> > > > > this race. This queue is useful for situations where memcg1 and memcg2 hit
> > > > > memcg OOM at the same time and victim1 in memcg1 cannot terminate immediately.
> > > >
> > > > This can happen of course but the likelihood is _much_ smaller without
> > > > the global OOM because the memcg OOM killer is invoked from a lockless
> > > > context so the oom context cannot block the victim to proceed.
> > >
> > > Suppose mem_cgroup_out_of_memory() is called from a lockless context via
> > > mem_cgroup_oom_synchronize() called from pagefault_out_of_memory(), that
> > > "lockless" is talking about only current thread, doesn't it?
> >
> > Yes and you need the OOM context to sit on the same lock as the victim
> > to form a deadlock. So while the victim might be blocked somewhere it is
> > much less likely it would be deadlocked.
> >
> > > Since oom_kill_process() sets TIF_MEMDIE on first mm!=NULL thread of a
> > > victim process, it is possible that non-first mm!=NULL thread triggers
> > > pagefault_out_of_memory() and first mm!=NULL thread gets TIF_MEMDIE,
> > > isn't it?
> >
> > I got lost here completely. Maybe it is your usage of thread terminology
> > again.
>
> I'm using "process" == "thread group" which contains at least one "thread",
> and "thread" == "struct task_struct".
> My assumption is
>
> (1) app1 process has two threads named app1t1 and app1t2
> (2) app2 process has two threads named app2t1 and app2t2
> (3) app1t1->mm == app1t2->mm != NULL and app2t1->mm == app2t2->mm != NULL
> (4) app1 is in memcg1 and app2 is in memcg2
>
> and sequence is
>
> (1) app1t2 triggers pagefault_out_of_memory()
> (2) app1t2 calls mem_cgroup_out_of_memory() via mem_cgroup_oom_synchronize()
> (3) oom_scan_process_thread() selects app1 as an OOM victim process
> (4) find_lock_task_mm() selects app1t1 as an OOM victim thread
> (5) app1t1 gets TIF_MEMDIE

OK so we have a victim in memcg1 and app1t2 will get to do_exit right away
because we are in the page fault path...

> (6) app2t2 triggers pagefault_out_of_memory()
> (7) app2t2 calls mem_cgroup_out_of_memory() via mem_cgroup_oom_synchronize()
> (8) oom_scan_process_thread() selects app2 as an OOM victim process
> (9) find_lock_task_mm() selects app2t1 as an OOM victim thread
> (10) app2t1 gets TIF_MEMDIE
>
> .
>
> I'm talking about situation where app1t1 is blocked at down_write(&app1t1->mm->mmap_sem)
> because somebody else is already waiting at down_read(&app1t1->mm->mmap_sem) or is
> doing memory allocation between down_read(&app1t1->mm->mmap_sem) and
> up_read(&app1t1->mm->mmap_sem).

Unless we are under global OOM then this doesn't matter much because the
allocation request should succeed at some point in time and memcg
charges are bypassed for tasks with pending fatal signals. So we can
make a forward progress.

> In this case, this [PATCH 5/5] helps the OOM reaper to reap app2t1->mm
> after giving up waiting for down_read(&app1t1->mm->mmap_sem) to
> succeed.

Why does that matter at all?

--
Michal Hocko
SUSE Labs