Re: System freezes after OOM

From: Mikulas Patocka
Date: Fri Jul 15 2016 - 08:01:06 EST




On Fri, 15 Jul 2016, Michal Hocko wrote:

> On Thu 14-07-16 13:38:42, David Rientjes wrote:
> > On Thu, 14 Jul 2016, Michal Hocko wrote:
> >
> > > > It prevents the whole system from livelocking due to an oom killed process
> > > > stalling forever waiting for mempool_alloc() to return. No other threads
> > > > may be oom killed while waiting for it to exit.
> > >
> > > But it is true that the patch has unintended side effect for any mempool
> > > allocation from the reclaim path (aka PF_MEMALLOC context).
> >
> > If PF_MEMALLOC context is allocating too much memory reserves, then I'd
> > argue that is a problem independent of using mempool_alloc() since
> > mempool_alloc() can evolve directly into a call to the page allocator.
> > How does such a process guarantee that it cannot deplete memory reserves
> > with a simple call to the page allocator? Since nothing in the page
> > allocator is preventing complete depletion of reserves (it simply uses
> > ALLOC_NO_WATERMARKS), the caller in a PF_MEMALLOC context must be
> > responsible.

Well-written drivers should use mempools and not allocate pages directly
from the allocator. These drivers can proceed even if there is no memory
available.

Badly written drivers (loop block device; swapping to NFS) are prone to
deadlock anyway - there is no easy way to fix it.

> Well, the reclaim throttles the allocation request if there are too many
> pages under writeback and that should slow down the allocation rate and
> give the writeback some time to complete. But yes you are right there is
> nothing to prevent from memory depletion and it is really hard to come
> up with something with no fail semantic.
>
> Or do you have an idea how to throttle withou knowing how much memory
> will be actually consumed on the writeout path?
>
> > > So do you
> > > think we should rework your additional patch to be explicit about
> > > TIF_MEMDIE?
> >
> > Not sure which additional patch you're referring to, the only patch that I
> > proposed was commit f9054c70d28b which solved hundreds of machines from
> > timing out.
>
> I would like separate TIF_MEMDIE as an access to memory reserves from
> oom selection selection semantic. And let me repeat your proposed patch
> has a undesirable side effects so we should think about a way to deal
> with those cases. It might work for your setups but it shouldn't break
> others at the same time. OOM situation is quite unlikely compared to
> simple memory depletion by writing to a swap...
>
> > > Something like the following (not even compile tested for
> > > illustration). Tetsuo has properly pointed out that this doesn't work
> > > for multithreaded processes reliable but put that aside for now as that
> > > needs a fix on a different layer. I believe we can fix that quite
> > > easily after recent/planned changes.
> > > ---
> > > diff --git a/mm/mempool.c b/mm/mempool.c
> > > index 8f65464da5de..ea26d75c8adf 100644
> > > --- a/mm/mempool.c
> > > +++ b/mm/mempool.c
> > > @@ -322,20 +322,20 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> > >
> > > might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
> > >
> > > + gfp_mask |= __GFP_NOMEMALLOC; /* don't allocate emergency reserves */
> > > gfp_mask |= __GFP_NORETRY; /* don't loop in __alloc_pages */
> > > gfp_mask |= __GFP_NOWARN; /* failures are OK */
> > >
> > > gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
> > >
> > > repeat_alloc:
> > > - if (likely(pool->curr_nr)) {
> > > - /*
> > > - * Don't allocate from emergency reserves if there are
> > > - * elements available. This check is racy, but it will
> > > - * be rechecked each loop.
> > > - */
> > > - gfp_temp |= __GFP_NOMEMALLOC;
> > > - }
> > > + /*
> > > + * Make sure that the OOM victim will get access to memory reserves
> > > + * properly if there are no objects in the pool to prevent from
> > > + * livelocks.
> > > + */
> > > + if (!likely(pool->curr_nr) && test_thread_flag(TIF_MEMDIE))
> > > + gfp_temp &= ~__GFP_NOMEMALLOC;
> > >
> > > element = pool->alloc(gfp_temp, pool->pool_data);
> > > if (likely(element != NULL))
> > > @@ -359,7 +359,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> > > * We use gfp mask w/o direct reclaim or IO for the first round. If
> > > * alloc failed with that and @pool was empty, retry immediately.
> > > */
> > > - if ((gfp_temp & ~__GFP_NOMEMALLOC) != gfp_mask) {
> > > + if ((gfp_temp & __GFP_DIRECT_RECLAIM) != (gfp_mask & __GFP_DIRECT_RECLAIM)) {
> > > spin_unlock_irqrestore(&pool->lock, flags);
> > > gfp_temp = gfp_mask;
> > > goto repeat_alloc;
> >
> > This is bogus and quite obviously leads to oom livelock: if a process is
> > holding a mutex and does mempool_alloc(), since __GFP_WAIT is allowed in
> > process context for mempool allocation, it can stall here in an oom
> > condition if there are no elements available on the mempool freelist. If
> > the oom victim contends the same mutex, the system livelocks and the same
> > bug arises because the holder of the mutex loops forever. This is the
> > exact behavior that f9054c70d28b also fixes.
>
> Just to make sure I understand properly:
> Task A Task B Task C
> current->flags = PF_MEMALLOC
> mutex_lock(&foo) mutex_lock(&foo) out_of_memory
> mempool_alloc() select_bad__process = Task B
> alloc_pages(__GFP_NOMEMALLOC)
>
>
> That would be really unfortunate but it doesn't really differ much from
> other oom deadlocks when the victim is stuck behind an allocating task.
> This is a generic problem and our answer for that is the oom reaper
> which will tear down the address space of the victim asynchronously.
> Sure there is no guarantee it will free enough to get us unstuck because
> we are freeing only private unlocked memory but we rather fallback to
> another oom victim if the situation prevails even after the unmapping
> pass. So we shouldn't be stuck for ever.
>
> That being said should we rely for the mempool allocations the same as
> any other oom deadlock due to locks?

I think that mempool deadlock is really non-existent issue.

mempool for device mapper and most block device drivers guarantees forward
progress even if there is no memory available. It doesn't deadlock as long
as the underlying block device is processing requests. There is no reason
why should it deadlock when OOM killer is activated.

> > These aren't hypothetical situations, the patch fixed hundreds of machines
> > from regularly timing out. The fundamental reason is that mempool_alloc()
> > must not loop forever in process context: that is needed when the
> > allocator is either an oom victim itself or the oom victim is blocked by
> > an allocator. mempool_alloc() must guarantee forward progress in such a
> > context.
> >
> > The end result is that when in PF_MEMALLOC context, allocators must be
> > responsible and not deplete all memory reserves.
>
> How do you propose to guarantee that? You might have really complex IO
> setup and mempools have been the answer for guaranteeing forward progress
> for ages.
>
> --
> Michal Hocko
> SUSE Labs

Mikulas