Re: [PATCH 1/2 v2] bcachefs: do not use PF_MEMALLOC_NORECLAIM

From: Kent Overstreet
Date: Sun Sep 01 2024 - 21:35:49 EST


On Mon, Sep 02, 2024 at 10:23:06AM GMT, Dave Chinner wrote:
> On Thu, Aug 29, 2024 at 09:32:52AM -0400, Kent Overstreet wrote:
> > And more than that, this is coming from you saying "We didn't have
> > to handle memory allocation failures in IRIX, why can't we be like
> > IRIX? All those error paths are a pain to test, why can't we get
> > rid of them?"
> >
> You're not listening, Kent. We are not eliding error paths because
> they aren't (or cannot be) tested.
>
> It's a choice between whether a transient error (e.g. ENOMEM) should
> be considered a fatal error or not. The architectural choice that
> was made for XFS back in the 1990s was that the filesystem should
> never fail when transient errors occur. The choice was to wait for
> the transient error to go away and then continue on. The rest of the
> filesystem was build around these fundamental behavioural choices.

<snipped>

> Hence failing an IO and shutting down the filesystem because there
> are transient errors occuring in either the storage or the OS was
> absolutely the wrong thing to be doing. It still is the wrong thing
> to be doing - we want to wait until the transient error has
> progressed to being classified as a permanent error before we take
> drastic action like denying service to the filesystem.

Two different things are getting conflated here.

I'm saying forcefully that __GFP_NOFAIL isn't a good design decision,
and it's bad for system behaviour when we're thrashing, because you,
willy and others have been talking openly about making something like
that the norm, and that concerns me.

I'm _not_ saying that you should have to rearchitect your codebase to
deal with transient -ENOMEMs... that's clearly not going to happen.

But I am saying that kmalloc(__GFP_NOFAIL) _should_ fail and return NULL
in the case of bugs, because that's going to be an improvement w.r.t.
system robustness, in exactly the same way we don't use BUG_ON() if it's
something that we can't guarantee won't happen in the wild - we WARN()
and try to handle the error as best we can.

If we could someday get PF_MEMALLOC_NORECLAIM annotated everywhere we're
not in a sleepable context (or more likely, teach kmalloc() about
preempt count) - that turns a whole class of "scheduling while atomic"
bugs into recoverable errors. That'd be a good thing.

In XFS's case, the only thing you'd ever want to do on failure of a
__GFP_NOFAIL allocation is do an emergency shutdown - the code is buggy,
and those bugs tend to be quickly found and fixed (even quicker when the
system stays usable and the user can collect a backtrace).

> > Except that's bullshit; at the very least any dynamically sized
> > allocation _definitely_ has to have an error path that's tested, and if
> > there's questions about the context a code path might run in, that
> > that's another reason.
>
> We know the context we run __GFP_NOFAIL allocations in - transaction
> context absolutely requires a task context because we take sleeping
> locks, submit and wait on IO, do blocking memory allocation, etc. We
> also know the size of the allocations because we've bounds checked
> everything before we do an allocation.

*nod*

> Hence this argument of "__GFP_NOFAIL aboslutely requires error
> checking because an invalid size or wonrg context might be used"
> is completely irrelevant to XFS. If you call into the filesytsem
> from an atomic context, you've lost long before we get to memory
> allocation because filesystems take sleeping locks....

Yeah, if you know the context of your allocation and you have bounds on
the allocation size that's all fine - that's your call.

_As a general policy_, I will say that even __GFP_NOFAIL allocations
should have error paths - becuase lots of allocations have sizes that
userspace can control, so it becomes a security issue and we need to be
careful what we tell people.

> > GFP_NOFAIL is the problem here, and if it's encouraging this brain
> > damaged "why can't we just get rid of error paths?" thinking, then it
> > should be removed.
> >
> > Error paths have to exist, and they have to be tested.
>
> __GFP_NOFAIL is *not the problem*, and we are not "avoiding error
> handling". Backing off, looping and trying again is a valid
> mechanism for handling transient failure conditions. Having a flag
> that tells the allocator to "backoff, loop and try again" is a
> perfectly good way of providing a generic error handling mechanism.

But we do have to differentiate between transient allocation failures
and failures that are a result of bugs. Because just looping means a
buggy allocation call becomes, at best, a hung task you can't kill.