Re: [v3 1/6] mm, oom: use oom_victims counter to synchronize oom victim selection

From: Roman Gushchin
Date: Thu Jun 22 2017 - 13:00:12 EST


On Thu, Jun 22, 2017 at 09:40:28AM +0900, Tetsuo Handa wrote:
> Roman Gushchin wrote:
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -992,6 +992,13 @@ bool out_of_memory(struct oom_control *oc)
> > if (oom_killer_disabled)
> > return false;
> >
> > + /*
> > + * If there are oom victims in flight, we don't need to select
> > + * a new victim.
> > + */
> > + if (atomic_read(&oom_victims) > 0)
> > + return true;
> > +
> > if (!is_memcg_oom(oc)) {
> > blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
> > if (freed > 0)
>
> Above in this patch and below in patch 5 are wrong.
>
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -665,7 +672,13 @@ static void mark_oom_victim(struct task_struct *tsk)
> > * that TIF_MEMDIE tasks should be ignored.
> > */
> > __thaw_task(tsk);
> > - atomic_inc(&oom_victims);
> > +
> > + /*
> > + * If there are no oom victims in flight,
> > + * give the task an access to the memory reserves.
> > + */
> > + if (atomic_inc_return(&oom_victims) == 1)
> > + set_tsk_thread_flag(tsk, TIF_MEMDIE);
> > }
> >
> > /**
>
> The OOM reaper is not available for CONFIG_MMU=n kernels, and timeout based
> giveup is not permitted, but a multithreaded process might be selected as
> an OOM victim. Not setting TIF_MEMDIE to all threads sharing an OOM victim's
> mm increases possibility of preventing some OOM victim thread from terminating
> (e.g. one of them cannot leave __alloc_pages_slowpath() with mmap_sem held for
> write due to waiting for the TIF_MEMDIE thread to call exit_oom_victim() when
> the TIF_MEMDIE thread is waiting for the thread with mmap_sem held for write).

I agree, that CONFIG_MMU=n is a special case, and the proposed approach can't
be used directly. But can you, please, why do you find the first chunk wrong?
Basically, it's exactly what we do now: we increment oom_victims for every oom
victim, and every process decrements this counter during it's exit path.
If there is at least one existing victim, we will select it again, so it's just
an optimization. Am I missing something? Why should we start new victim selection
if there processes that will likely quit and release memory soon?

Thanks!