Re: [PATCH 1/2 v2] mm: Allow small allocations to fail

From: Michal Hocko
Date: Wed Mar 18 2015 - 08:05:10 EST


On Wed 18-03-15 10:10:39, Vlastimil Babka wrote:
> On 03/17/2015 08:41 PM, Michal Hocko wrote:
> >On Tue 17-03-15 13:26:28, Johannes Weiner wrote:
> >>On Tue, Mar 17, 2015 at 03:17:29PM +0100, Michal Hocko wrote:
> >>>On Tue 17-03-15 09:29:26, Johannes Weiner wrote:
> >>>>On Tue, Mar 17, 2015 at 11:25:08AM +0100, Michal Hocko wrote:
> >>>>>On Mon 16-03-15 17:11:46, Johannes Weiner wrote:
> >>>>>>A sysctl certainly doesn't sound appropriate to me because this is not
> >>>>>>a tunable that we expect people to set according to their usecase. We
> >>>>>>expect our model to work for *everybody*. A boot flag would be
> >>>>>>marginally better but it still reeks too much of tunable.
> >>>>>
> >>>>>I am OK with a boot option as well if the sysctl is considered
> >>>>>inappropriate. It is less flexible though. Consider a regression testing
> >>>>>where the same load is run 2 times once with failing allocations and
> >>>>>once without it. Why should we force the tester to do a reboot cycle?
> >>>>
> >>>>Because we can get rid of the Kconfig more easily once we transitioned.
> >>>
> >>>How? We might be forced to keep the original behavior _for ever_. I do
> >>>not see any difference between runtime, boottime or compiletime option.
> >>>Except for the flexibility which is different for each one of course. We
> >>>can argue about which one is the most appropriate of course but I feel
> >>>strongly we cannot go and change the semantic right away.
> >>
> >>Sure, why not add another slab allocator while you're at it. How many
> >>times do we have to repeat the same mistakes? If the old model sucks,
> >>then it needs to be fixed or replaced. Don't just offer another one
> >>that sucks in different ways and ask the user to pick their poison,
> >>with a promise that we might improve the newer model until it's
> >>suitable to ditch the old one.
> >>
> >>This is nothing more than us failing and giving up trying to actually
> >>solve our problems.
> >
> >I probably fail to communicate the primary intention here. The point
> >of the knob is _not_ to move the responsibility to userspace. Although
> >I would agree that the knob as proposed might look like that and that is
> >my fault.
> >
> >The primary motivation is to actually help _solving_ our long standing
> >problem. Default non-failing allocations policy is simply wrong and we
> >should move away from it. We have a way to _explicitly_ request such a
> >behavior. Are we in agreement on this part?
> >
> >The problem, as I see it, is that such a change cannot be pushed to
> >Linus tree without extensive testing because there are thousands of code
> >paths which never got exercised. We have basically two options here.
> >Either have a non-upstream patch (e.g. sitting in mmotm and linux-next)
> >and have developers to do their testing. This will surely help to
> >catch a lot of fallouts and fix them right away. But we will miss those
> >who are using Linus based trees and would be willing to help to test
> >in their loads which we never dreamed of.
> >The other option would be pushing an experimental code to the Linus
> >tree (and distribution kernels) and allow people to turn it on to help
> >testing.
> >
> >I am not ignoring the rest of the email, I just want to make sure we are
> >on the same page before we go into a potentially lengthy discussion just
> >to find out we are talking past each other.
> >
> >[...]
>
> After reading this discussion, my impression is: as I understand your
> motivation, the knob is supposed to expose code that has broken handling of
> small allocation failures, because the handling was never exercised (and
> thus found to be broken) before. The steps you are proposing are to allow
> this to be tested by those who understand that it might break their
> machines, until those broken allocation sites are either fixed or converted
> to __GFP_NOFAIL. We want the change of implicit nofail behavior to happen,
> as then we limit the potential deadlocks to explicitly annotated allocation
> sites, which simplifies efforts to prevent the deadlocks (e.g. with
> reserves).

Exactly.

> AFAIU, Johannes is worried that the knob adds some possibility that
> allocations will fail prematurely, even though further trying would allow it
> to succeed and would not introduce a deadlock.

OK, that definitely wasn't an intention and I have realized that the
knob as proposed is a bad way to go forward. I should have gone with
on/off knob which would only tell whether legacy mode (don't fail small
allocations by default) is enabled or disabled.

But this would still have the configuration space for testing issues
mentioned by Johannes. So I would like to hear what people think about
the following points:
1) Should we move away from non-failing small allocation default at all?
2) If yes do we want a transition plan or simply step in that direction
and wait for what pops out?

> The probability of this is
> hard to predict even inside MM, yet we assume that userspace will set the
> value. This might discourage some of the volunteers that would be willing to
> test the new behavior, since they could get extra spurious failures. He
> would like to see this to be as reliable as possible, failing allocation
> only when it's absolutely certain that nothing else can be done, and not
> depend on a magically set value from userspace. He also believes that we can
> still improve on the what "can be done" part.

And I agree with that. I am sorry if that wasn't clear from my previous
emails. I just believe that what ever improvements we come up with we
still should make the (non)failing semantic clear.

> I'll add that I think if we do improve the reclaim etc, and make allocations
> failures rarer, then the whole testing effort will have much lower chance of
> finding the places where allocation failures are not handled properly. Also
> Michal says that catching those depend on running all "their loads which we
> never dreamed of". In that case, if our goal is to fix all broken allocation
> sites with some quantifiable probability, I'm afraid we might be really
> better off with some form of fault injection, which will trigger the
> failures with the probability we set, and not depend on corner case
> low memory conditions manifesting just at the time the workload is at
> one of the broken allocation sites.

Fault injection is certainly another and orthogonal way to go IMO.

--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/