Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

From: Andrew Morton
Date: Tue Nov 23 2021 - 22:48:41 EST

On Wed, 24 Nov 2021 14:16:56 +1100 "NeilBrown" <neilb@xxxxxxx> wrote:

> On Wed, 24 Nov 2021, Andrew Morton wrote:
> >
> > I added GFP_NOFAIL back in the mesozoic era because quite a lot of
> > sites were doing open-coded try-forever loops. I thought "hey, they
> > shouldn't be doing that in the first place, but let's at least
> > centralize the concept to reduce code size, code duplication and so
> > it's something we can now grep for". But longer term, all GFP_NOFAIL
> > sites should be reworked to no longer need to do the retry-forever
> > thing. In retrospect, this bright idea of mine seems to have added
> > license for more sites to use retry-forever. Sigh.
> One of the costs of not having GFP_NOFAIL (or similar) is lots of
> untested failure-path code.

Well that's bad of the relevant developers and testers! It isn't that
hard to fake up allocation failures. Either with the formal fault
injection framework or with ad-hackery.

> When does an allocation that is allowed to retry and reclaim ever fail
> anyway? I think the answer is "only when it has been killed by the oom
> killer". That of course cannot happen to kernel threads, so maybe
> kernel threads should never need GFP_NOFAIL??

> I'm not sure the above is 100%, but I do think that is the sort of
> semantic that we want. We want to know what kmalloc failure *means*.
> We also need well defined and documented strategies to handle it.
> mempools are one such strategy, but not always suitable.

Well, mempools aren't "handling" it. They're just another layer to
make memory allocation attempts appear to be magical. The preferred
answer is "just handle the damned error and return ENOMEM".

Obviously this gets very painful at times (arguably because of
high-level design shortcomings). The old radix_tree_preload approach
was bulletproof, but was quite a lot of fuss.

> preallocating can also be useful but can be clumsy to implement. Maybe
> we should support a process preallocating a bunch of pages which can
> only be used by the process - and are auto-freed when the process
> returns to user-space. That might allow the "error paths" to be simple
> and early, and subsequent allocations that were GFP_USEPREALLOC would be
> safe.

Yes, I think something like that would be quite feasible. Need to
prevent interrupt code from stealing the task's page store.

It can be a drag calculating (by hand) what the max possible amount of
allocation will be and one can end up allocating and then not using a
lot of memory.

I forget why radix_tree_preload used a cpu-local store rather than a
per-task one.

Plus "what order pages would you like" and "on which node" and "in
which zone", etc...