Re: [PATCH] mm,oom: Use timeout based back off.

From: David Rientjes
Date: Mon Oct 22 2018 - 17:11:15 EST


On Sat, 20 Oct 2018, Tetsuo Handa wrote:

> This patch changes the OOM killer to wait for either
>
> (A) __mmput() of the OOM victim's mm completes
>
> or
>
> (B) the OOM reaper gives up waiting for (A) because memory pages
> used by the OOM victim's mm did not decrease for one second
>
> in order to mitigate at least three problems
>
> (1) an OOM victim needlessly selects next OOM victim if the OOM-killed
> processes are using clone(CLONE_VM) without CLONE_THREAD because
> task_will_free_mem(current) in out_of_memory() returns false when
> MMF_OOM_SKIP was set before remaining OOM-killed processes reach
> out_of_memory().
>
> (2) an memcg OOM event needlessly selects next OOM victim because we
> are assuming that the OOM reaper can reclaim majority of the OOM
> victim's mm, but sometimes we need to wait for completion of
> free_pgtables() in exit_mmap() in order to reclaim enough memory.
>
> (3) an memcg OOM event from a multithreaded process by an unprivileged
> user can needlessly trigger flooding of "Out of memory and no
> killable processes..." and dump_header() messages because
> task_will_free_mem(current) in out_of_memory() returns false when
> MMF_OOM_SKIP was set before remaining OOM-killed threads reach
> out_of_memory().
>
> all caused by setting MMF_OOM_SKIP too early.
>
> Michal has proposed an attempt to handover setting of MMF_OOM_SKIP to
> the OOM victim's exit path [1] in order to handle (2), but there was no
> feedback (except me) and nobody knows whether it is really safe and is
> worth constrain future changes. Not only that attempt can mitigate only
> portion of exit_mmap() (rather than until the OOM victim thread becomes
> invisible from the OOM killer), that attempt does not help at all for (1)
> and (3) because __mmput() cannot be called.
>
> I have proposed many patches which mitigate (1) and (3) without using
> timeout based approach, but Michal is rejecting them and wants to address
> the root cause that MMF_OOM_SKIP is set too early. And nobody (including
> Michal) has time to make the OOM reaper reclaim more memory (including
> mlock()ed and shared memory, and mmap_sem contention) before setting
> MMF_OOM_SKIP. We are deadlocked there.
>
> Michal has been refusing timeout based approach, but I don't think this
> is something we have to be frayed around the edge about possibility of
> overlooking races/bugs just because Michal does not want to use timeout.
> I believe that timeout based back off is the only approach we can use
> for now.
>

I've proposed patches that have been running for months in a production
environment that make the oom killer useful without serially killing many
processes unnecessarily. At this point, it is *much* easier to just fork
the oom killer logic rather than continue to invest time into fixing it in
Linux. That's unfortunate because I'm sure you realize how problematic
the current implementation is, how abusive it is, and have seen its
effects yourself. I admire your persistance in trying to fix the issues
surrounding the oom killer, but have come to the conclusion that forking
it is a much better use of time.