Re: [PATCH 4/3] mm, oom: drop the last allocation attempt before out_of_memory

From: Tetsuo Handa
Date: Fri Jan 29 2016 - 05:39:38 EST


Johannes Weiner wrote:
> On Thu, Jan 28, 2016 at 03:19:08PM -0800, David Rientjes wrote:
> > On Thu, 28 Jan 2016, Johannes Weiner wrote:
> >
> > > The check has to happen while holding the OOM lock, otherwise we'll
> > > end up killing much more than necessary when there are many racing
> > > allocations.
> > >
> >
> > Right, we need to try with ALLOC_WMARK_HIGH after oom_lock has been
> > acquired.
> >
> > The situation is still somewhat fragile, however, but I think it's
> > tangential to this patch series. If the ALLOC_WMARK_HIGH allocation fails
> > because an oom victim hasn't freed its memory yet, and then the TIF_MEMDIE
> > thread isn't visible during the oom killer's tasklist scan because it has
> > exited, we still end up killing more than we should. The likelihood of
> > this happening grows with the length of the tasklist.
> >
> > Perhaps we should try testing watermarks after a victim has been selected
> > and immediately before killing? (Aside: we actually carry an internal
> > patch to test mem_cgroup_margin() in the memcg oom path after selecting a
> > victim because we have been hit with this before in the memcg path.)

Yes. Moving final testing to after selecting an OOM victim can reduce the
possibility of killing more OOM victims than we need. But unfortunately, it is
likely that memory becomes available (i.e. get_page_from_freelist() succeeds)
during dump_header() is printing OOM messages using printk(), for printk() is
a slow operation compared to selecting a victim. This happens very much later
counted from the moment the victim cleared TIF_MEMDIE.

We can avoid killing more OOM victims than we need if we move final testing to
after printing OOM messages, but we can't avoid printing OOM messages when we
don't kill a victim. Maybe this is not a problem if we do

pr_err("But did not kill any process ...")

instead of

do_send_sig_info(SIGKILL);
mark_oom_victim();
pr_err("Killed process %d (%s) ...")

when final testing succeeded.

> >
> > I would think that retrying with ALLOC_WMARK_HIGH would be enough memory
> > to deem that we aren't going to immediately reenter an oom condition so
> > the deferred killing is a waste of time.
> >
> > The downside is how sloppy this would be because it's blurring the line
> > between oom killer and page allocator. We'd need the oom killer to return
> > the selected victim to the page allocator, try the allocation, and then
> > call oom_kill_process() if necessary.

I assumed that Michal wants to preserve the boundary between the OOM killer
and the page allocator. Therefore, I proposed a patch
( http://lkml.kernel.org/r/201512291559.HGA46749.VFOFSOHLMtFJQO@xxxxxxxxxxxxxxxxxxx )
which tries to manage it without returning a victim and without depending on
TIF_MEMDIE or oom_victims.

>
> https://lkml.org/lkml/2015/3/25/40
>
> We could have out_of_memory() wait until the number of outstanding OOM
> victims drops to 0. Then __alloc_pages_may_oom() doesn't relinquish
> the lock until its kill has been finalized:
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 914451a..4dc5b9d 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -892,7 +892,9 @@ bool out_of_memory(struct oom_control *oc)
> * Give the killed process a good chance to exit before trying
> * to allocate memory again.
> */
> - schedule_timeout_killable(1);
> + if (!test_thread_flag(TIF_MEMDIE))
> + wait_event_timeout(oom_victims_wait,
> + !atomic_read(&oom_victims), HZ);
> }
> return true;
> }
>

oom_victims became 0 does not mean that memory became available (i.e.
get_page_from_freelist() will succeed). I think this patch wants some
effort for trying to reduce possibility of killing more OOM victims
than we need.