Re: System freezes after OOM
From: David Rientjes
Date: Fri Jul 15 2016 - 17:47:47 EST
On Fri, 15 Jul 2016, Michal Hocko wrote:
> > If PF_MEMALLOC context is allocating too much memory reserves, then I'd
> > argue that is a problem independent of using mempool_alloc() since
> > mempool_alloc() can evolve directly into a call to the page allocator.
> > How does such a process guarantee that it cannot deplete memory reserves
> > with a simple call to the page allocator? Since nothing in the page
> > allocator is preventing complete depletion of reserves (it simply uses
> > ALLOC_NO_WATERMARKS), the caller in a PF_MEMALLOC context must be
> > responsible.
>
> Well, the reclaim throttles the allocation request if there are too many
> pages under writeback and that should slow down the allocation rate and
> give the writeback some time to complete. But yes you are right there is
> nothing to prevent from memory depletion and it is really hard to come
> up with something with no fail semantic.
>
If the reclaimer is allocating memory, it can fully deplete memory
reserves with ALLOC_NO_WATERMARKS without any direct reclaim itself and
we're relying on kswapd entirely if nothing else is reclaiming in parallel
(and depleting memory reserves itself in parallel). It's a difficult
problem because memory reserves can be very small and concurrent
PF_MEMALLOC allocation contexts can lead to quick depletion. I don't
think it's a throttling problem itself, it's more scalability.
> I would like separate TIF_MEMDIE as an access to memory reserves from
> oom selection selection semantic. And let me repeat your proposed patch
> has a undesirable side effects so we should think about a way to deal
> with those cases. It might work for your setups but it shouldn't break
> others at the same time. OOM situation is quite unlikely compared to
> simple memory depletion by writing to a swap...
>
I haven't proposed any patch, not sure what the reference is to. There's
two fundamental ways to go about it: (1) ensure mempool_alloc() can make
forward progress (whether that's by way of gfp flags or access to memory
reserves, which may depend on the process context such as PF_MEMALLOC) or
(2) rely on an implementation detail of mempools to never access memory
reserves, although it is shown to not livelock systems on 4.7 and earlier
kernels, and instead rely on users of the same mempool to return elements
to the freelist in all contexts, including oom contexts. The mempool
implementation itself shouldn't need any oom awareness, that should be a
page allocator issue.
If the mempool user can guarantee that elements will be returned to the
freelist in all contexts, we could relax the restriction that mempool
users cannot use __GFP_NOMEMALLOC and leave it up to them to prevent
access to memory reserves but only in situations where forward progress
can be guaranteed. That's a simple change and doesn't change mempool or
page allocator behavior for everyone, but rather only for those that
opt-in. I think this is the way the dm folks should proceed, but let's
not encode any special restriction on access to memory reserves as an
implementation detail to mempools, specifically for processes that have
PF_MEMALLOC set.
> Just to make sure I understand properly:
> Task A Task B Task C
> current->flags = PF_MEMALLOC
> mutex_lock(&foo) mutex_lock(&foo) out_of_memory
> mempool_alloc() select_bad__process = Task B
> alloc_pages(__GFP_NOMEMALLOC)
>
Not sure who is grabbing foo first with this, I assume Task A and Task B
is contending. If that's the case, then yes, this is the dm_request() oom
livelock that went unresolved for two hours on our machines and timed
them all out. This is a swapless environment that heavily oversubscribes
the machine, so not everybody's use case, but it needs to be resolved.
> That would be really unfortunate but it doesn't really differ much from
> other oom deadlocks when the victim is stuck behind an allocating task.
I'm well aware of many of the system oom and memcg oom livelocks from
experience, unfortunately :)