Re: System freezes after OOM
From: David Rientjes
Date: Thu Jul 14 2016 - 16:38:57 EST
On Thu, 14 Jul 2016, Michal Hocko wrote:
> > It prevents the whole system from livelocking due to an oom killed process
> > stalling forever waiting for mempool_alloc() to return. No other threads
> > may be oom killed while waiting for it to exit.
>
> But it is true that the patch has unintended side effect for any mempool
> allocation from the reclaim path (aka PF_MEMALLOC context).
If PF_MEMALLOC context is allocating too much memory reserves, then I'd
argue that is a problem independent of using mempool_alloc() since
mempool_alloc() can evolve directly into a call to the page allocator.
How does such a process guarantee that it cannot deplete memory reserves
with a simple call to the page allocator? Since nothing in the page
allocator is preventing complete depletion of reserves (it simply uses
ALLOC_NO_WATERMARKS), the caller in a PF_MEMALLOC context must be
responsible.
> So do you
> think we should rework your additional patch to be explicit about
> TIF_MEMDIE?
Not sure which additional patch you're referring to, the only patch that I
proposed was commit f9054c70d28b which solved hundreds of machines from
timing out.
> Something like the following (not even compile tested for
> illustration). Tetsuo has properly pointed out that this doesn't work
> for multithreaded processes reliable but put that aside for now as that
> needs a fix on a different layer. I believe we can fix that quite
> easily after recent/planned changes.
> ---
> diff --git a/mm/mempool.c b/mm/mempool.c
> index 8f65464da5de..ea26d75c8adf 100644
> --- a/mm/mempool.c
> +++ b/mm/mempool.c
> @@ -322,20 +322,20 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>
> might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
>
> + gfp_mask |= __GFP_NOMEMALLOC; /* don't allocate emergency reserves */
> gfp_mask |= __GFP_NORETRY; /* don't loop in __alloc_pages */
> gfp_mask |= __GFP_NOWARN; /* failures are OK */
>
> gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
>
> repeat_alloc:
> - if (likely(pool->curr_nr)) {
> - /*
> - * Don't allocate from emergency reserves if there are
> - * elements available. This check is racy, but it will
> - * be rechecked each loop.
> - */
> - gfp_temp |= __GFP_NOMEMALLOC;
> - }
> + /*
> + * Make sure that the OOM victim will get access to memory reserves
> + * properly if there are no objects in the pool to prevent from
> + * livelocks.
> + */
> + if (!likely(pool->curr_nr) && test_thread_flag(TIF_MEMDIE))
> + gfp_temp &= ~__GFP_NOMEMALLOC;
>
> element = pool->alloc(gfp_temp, pool->pool_data);
> if (likely(element != NULL))
> @@ -359,7 +359,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> * We use gfp mask w/o direct reclaim or IO for the first round. If
> * alloc failed with that and @pool was empty, retry immediately.
> */
> - if ((gfp_temp & ~__GFP_NOMEMALLOC) != gfp_mask) {
> + if ((gfp_temp & __GFP_DIRECT_RECLAIM) != (gfp_mask & __GFP_DIRECT_RECLAIM)) {
> spin_unlock_irqrestore(&pool->lock, flags);
> gfp_temp = gfp_mask;
> goto repeat_alloc;
This is bogus and quite obviously leads to oom livelock: if a process is
holding a mutex and does mempool_alloc(), since __GFP_WAIT is allowed in
process context for mempool allocation, it can stall here in an oom
condition if there are no elements available on the mempool freelist. If
the oom victim contends the same mutex, the system livelocks and the same
bug arises because the holder of the mutex loops forever. This is the
exact behavior that f9054c70d28b also fixes.
These aren't hypothetical situations, the patch fixed hundreds of machines
from regularly timing out. The fundamental reason is that mempool_alloc()
must not loop forever in process context: that is needed when the
allocator is either an oom victim itself or the oom victim is blocked by
an allocator. mempool_alloc() must guarantee forward progress in such a
context.
The end result is that when in PF_MEMALLOC context, allocators must be
responsible and not deplete all memory reserves.