Re: [patch 00/12] mm: page_alloc: improve OOM mechanism and policy

From: Dave Chinner
Date: Thu Mar 26 2015 - 15:58:43 EST


On Wed, Mar 25, 2015 at 02:17:04AM -0400, Johannes Weiner wrote:
> Hi everybody,
>
> in the recent past we've had several reports and discussions on how to
> deal with allocations hanging in the allocator upon OOM.
>
> The idea of this series is mainly to make the mechanism of detecting
> OOM situations reliable enough that we can be confident about failing
> allocations, and then leave the fallback strategy to the caller rather
> than looping forever in the allocator.
>
> The other part is trying to reduce the __GFP_NOFAIL deadlock rate, at
> least for the short term while we don't have a reservation system yet.

A valid goal, but I think this series goes about it the wrong way.
i.e. it forces us to use __GFP_NOFAIL rather than providing us a
valid fallback mechanism to access reserves.

....

> mm: page_alloc: emergency reserve access for __GFP_NOFAIL allocations
>
> An exacerbation of the victim-stuck-behind-allocation scenario are
> __GFP_NOFAIL allocations, because they will actually deadlock. To
> avoid this, or try to, give __GFP_NOFAIL allocations access to not
> just the OOM reserves but also the system's emergency reserves.
>
> This is basically a poor man's reservation system, which could or
> should be replaced later on with an explicit reservation system that
> e.g. filesystems have control over for use by transactions.
>
> It's obviously not bulletproof and might still lock up, but it should
> greatly reduce the likelihood. AFAIK Andrea, whose idea this was, has
> been using this successfully for some time.

So, if we want GFP_NOFS allocations to be able to dip into a
small extra reservation to make progress at ENOMEM, we have to use
use __GFP_NOFAIL because looping ourselves won't allow use of these
extra reserves?

> mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM
>
> Another hang that was reported was from NOFS allocations. The trouble
> with these is that they can't issue or wait for writeback during page
> reclaim, and so we don't want to OOM kill on their behalf. However,
> with such restrictions on making progress, they are prone to hangs.

And because this effectively means GFP_NOFS allocations are
going to fail much more often, we're either going to have to loop
ourselves or use __GFP_NOFAIL...

> This patch makes NOFS allocations fail if reclaim can't free anything.
>
> It would be good if the filesystem people could weigh in on whether
> they can deal with failing GFP_NOFS allocations, or annotate the
> exceptions with __GFP_NOFAIL etc. It could well be that a middle
> ground is required that allows using the OOM killer before giving up.

... which looks to me like a catch-22 situation for us: We
have reserves, but callers need to use __GFP_NOFAIL to access them.
GFP_NOFS is going to fail more often, so callers need to handle that
in some way, either by looping or erroring out.

But if we loop manually because we try to handle ENOMEM situations
gracefully (e.g. try a number of times before erroring out) we can't
dip into the reserves because the only semantics being provided are
"try-once-without-reserves" or "try-forever-with-reserves". i.e.
what we actually need here is "try-once-with-reserves" semantics so
that we can make progress after a failing GFP_NOFS
"try-once-without-reserves" allocation.

IOWS, __GFP_NOFAIL is not the answer here - it's GFP_NOFS |
__GFP_USE_RESERVE that we need on the failure fallback path. Which,
incidentally, is trivial to add to the XFS allocation code. Indeed,
I'll request that you test series like this on metadata intensive
filesystem workloads on XFS under memory stress and quantify how
many new "XFS: possible deadlock in memory allocation" warnings are
emitted. If the patch set floods the system with such warnings, then
it means the proposed means the fallback for "caller handles
allocation failure" is not making progress.

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/