Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes
From: David Rientjes
Date: Wed May 30 2018 - 17:06:59 EST
On Mon, 28 May 2018, Michal Hocko wrote:
> > That's not sufficient since the oom reaper is also not able to oom reap if
> > the mm has blockable mmu notifiers or all memory is shared filebacked
> > memory, so it immediately sets MMF_OOM_SKIP and additional processes are
> > oom killed.
>
> Could you be more specific with a real world example where that is the
> case? I mean the full address space of non-reclaimable file backed
> memory where waiting some more would help? Blockable mmu notifiers are
> a PITA for sure. I wish we could have a better way to deal with them.
> Maybe we can tell them we are in the non-blockable context and have them
> release as much as possible. Still something that a random timeout
> wouldn't help I am afraid.
>
It's not a random timeout, it's sufficiently long such that we don't oom
kill several processes needlessly in the very rare case where oom livelock
would actually prevent the original victim from exiting. The oom reaper
processing an mm, finding everything to be mlocked, and immediately
MMF_OOM_SKIP is inappropriate. This is rather trivial to reproduce for a
large memory hogging process that mlocks all of its memory; we
consistently see spurious and unnecessary oom kills simply because the oom
reaper has set MMF_OOM_SKIP very early.
This patch introduces a "give up" period such that the oom reaper is still
allowed to do its good work but only gives up in the hope the victim can
make forward progress at some substantial period of time in the future. I
would understand the objection if oom livelock where the victim cannot
make forward progress were commonplace, but in the interest of not killing
several processes needlessly every time a large mlocked process is
targeted, I think it compels a waiting period.
> Trying to reap a different oom victim when the current one is not making
> progress during the lock contention is certainly something that make
> sense. It has been proposed in the past and we just gave it up because
> it was more complex. Do you have any specific example when this would
> help to justify the additional complexity?
>
I'm not sure how you're defining complexity, the patch adds ~30 lines of
code and prevents processes from needlessly being oom killed when oom
reaping is largely unsuccessful and before the victim finishes
free_pgtables() and then also allows the oom reaper to operate on multiple
mm's instead of processing one at a time. Obviously if there is a delay
before MMF_OOM_SKIP is set it requires that the oom reaper be able to
process other mm's, otherwise we stall needlessly for 10s. Operating on
multiple mm's in a linked list while waiting for victims to exit during a
timeout period is thus very much needed, it wouldn't make sense without
it.
> > But also note that even if oom reaping is possible, in the presence of an
> > antagonist that continues to allocate memory, that it is possible to oom
> > kill additional victims unnecessarily if we aren't able to complete
> > free_pgtables() in exit_mmap() of the original victim.
>
> If there is unbound source of allocations then we are screwed no matter
> what. We just hope that the allocator will get noticed by the oom killer
> and it will be stopped.
>
It's not unbounded, it's just an allocator that acts as an antagonist. At
the risk of being overly verbose, for system or memcg oom conditions: a
large mlocked process is oom killed, other processes continue to
allocate/charge, the oom reaper almost immediately grants MMF_OOM_SKIP
without being able to free any memory, and the other important processes
are needlessly oom killed before the original victim can reach
exit_mmap(). This happens a _lot_.
I'm open to hearing any other suggestions that you have other than waiting
some time period before MMF_OOM_SKIP gets set to solve this problem.