Re: [RFC PATCH 2/2] xfs: map KM_MAYFAIL to __GFP_RETRY_HARD

From: Michal Hocko
Date: Tue Jun 21 2016 - 13:01:13 EST


On Tue 21-06-16 00:22:49, Johannes Weiner wrote:
[...]
> As for changing the default - remember that we currently warn about
> allocation failures as if they were bugs, unless they are explicitely
> allocated with the __GFP_NOWARN flag. We can assume that the current
> __GFP_NOWARN sites are 1) commonly failing but 2) prefer to fall back
> rather than incurring latency (otherwise they would have added the
> __GFP_REPEAT flag). These sites would be a good list of candidates to
> annotate with __GFP_NORETRY.

This sounds like a good idea at first sight but a brief git grep shows
that many of them are just trying to silence the warning from non
sleeping allocations which are quite likely to fail. This wouldn't
be hard to filter out so we can ignore them.

Then there are things like 8be04b9374e5 ("treewide: Add __GFP_NOWARN
to k.alloc calls with v.alloc fallbacks") where the flag is added to
many places with vmalloc fallbacks. Do we want to weaken them in
favor of the vmalloc in general (note that it is not clear from the size
whether they are costly or !costly)?

I have looked at some random others and they are adding the flag without
any explanation so it is not really clear what was the motivation.

To me it seems like the flag is used quite randomly. I suspect there are
many places which do not have that flag just because nobody bothered to
report the allocation failure which is hard to reproduce.

> If we made __GFP_REPEAT then the default,
> the sites that would then try harder are the same sites that would now
> emit page allocation failure warnings. These are rare, and the only
> times I have seen them is under enough load that latency is shot to
> hell anyway. So I'm not really convinced by the regression argument.

You do not need to be under a heavy load to fail those allocations. It
is sufficient to have the memory fragmented which might be just a matter
of time. I am worried that we have hard to examine number of allocation
requests that might change the overall system behavior because they
might trigger more reclaim/swap and the source of the behavior change
wouldn't be quite obvious. On the other hand we already have some places
already annotated to require a more effort which is the reason I would
find it better to follow up with that.

High order allocations can be really expensive and the current behavior
with the allocation warning has an advantage that we can see the failure
mode and get a bug report with the exact trace (hopefully) without too
much of a background interference. Then the subsystem familiar person
can judge whether that particular allocation is worth more effort or
different fallback.

I am not saying that changing the default behavior for costly
allocations is a no go. I just feel it is too risky and it would be
better to use "override the default because I know what I am doing"
flag. The __GFP_RETRY_HARD might be a terrible name and a better name
would cause less confusion (__GFP_RETRY_MAYFAIL?).

> But that would *actually* clean up the flags, not make them even more
> confusing:
>
> Allocations that can't ever handle failure would use __GFP_NOFAIL.
>
> Callers like XFS would use __GFP_MAYFAIL specifically to disable the
> implicit __GFP_NOFAIL of !costly allocations.
>
> Callers that would prefer falling back over killing and looping would
> use __GFP_NORETRY.
>
> Wouldn't that cover all usecases and be much more intuitive, both in
> the default behavior as well as in the names of the flags?

How do we describe kcompacd vs. page fault THP allocations? We do not
want to cause a lot of reclaim for those but we can wait for compaction
for the first while we would prefer not to for the later.

Thanks!
--
Michal Hocko
SUSE Labs