Re: [PATCH 1/2 v2] bcachefs: do not use PF_MEMALLOC_NORECLAIM
From: Dave Chinner
Date: Sun Sep 01 2024 - 20:23:26 EST
On Thu, Aug 29, 2024 at 09:32:52AM -0400, Kent Overstreet wrote:
> On Thu, Aug 29, 2024 at 11:12:18PM GMT, Dave Chinner wrote:
> > On Thu, Aug 29, 2024 at 06:02:32AM -0400, Kent Overstreet wrote:
> > > On Wed, Aug 28, 2024 at 02:09:57PM GMT, Dave Chinner wrote:
> > > > On Tue, Aug 27, 2024 at 08:15:43AM +0200, Michal Hocko
> > > > wrote:
> > > > > From: Michal Hocko <mhocko@xxxxxxxx>
> > > > >
> > > > > bch2_new_inode relies on PF_MEMALLOC_NORECLAIM to try to
> > > > > allocate a new inode to achieve GFP_NOWAIT semantic while
> > > > > holding locks. If this allocation fails it will drop locks
> > > > > and use GFP_NOFS allocation context.
> > > > >
> > > > > We would like to drop PF_MEMALLOC_NORECLAIM because it is
> > > > > really dangerous to use if the caller doesn't control the
> > > > > full call chain with this flag set. E.g. if any of the
> > > > > function down the chain needed GFP_NOFAIL request the
> > > > > PF_MEMALLOC_NORECLAIM would override this and cause
> > > > > unexpected failure.
> > > > >
> > > > > While this is not the case in this particular case using
> > > > > the scoped gfp semantic is not really needed bacause we
> > > > > can easily pus the allocation context down the chain
> > > > > without too much clutter.
> > > > >
> > > > > Acked-by: Christoph Hellwig <hch@xxxxxx> Signed-off-by:
> > > > > Michal Hocko <mhocko@xxxxxxxx>
> > > >
> > > > Looks good to me.
> > > >
> > > > Reviewed-by: Dave Chinner <dchinner@xxxxxxxxxx>
> > >
> > > Reposting what I wrote in the other thread:
> >
> > I've read the thread. I've heard what you have had to say. Like
> > several other people, I think your position is just not
> > practical or reasonable.
> >
> > I don't care about the purity or the safety of the API - the
> > practical result of PF_MEMALLOC_NORECLAIM is that __GFP_NOFAIL
> > allocation can now fail and that will cause unexpected kernel
> > crashes. Keeping existing code and API semantics working
> > correctly (i.e. regression free) takes precedence over new
> > functionality or API features that people want to introduce.
> >
> > That's all there is to it. This is not a hill you need to die
> > on.
>
> And more than that, this is coming from you saying "We didn't have
> to handle memory allocation failures in IRIX, why can't we be like
> IRIX? All those error paths are a pain to test, why can't we get
> rid of them?"
>
You're not listening, Kent. We are not eliding error paths because
they aren't (or cannot be) tested.
It's a choice between whether a transient error (e.g. ENOMEM) should
be considered a fatal error or not. The architectural choice that
was made for XFS back in the 1990s was that the filesystem should
never fail when transient errors occur. The choice was to wait for
the transient error to go away and then continue on. The rest of the
filesystem was build around these fundamental behavioural choices.
This goes beyond memory allocation - we do it for IO errors, too.
e.g. metadata writeback keeps trying to write back the metadata
repeatedly on -EIO. On every EIO from a metadata write, we will
immediately attempt a rewrite without a backoff. If that rewrite
then fails, wei requeue the write for later resubmission. That means
we back off and wait for up to 30s before attempting the next
rewrite.
Hence -EIO on async metadata writeback won't fail/shutdown the
filesystem until a (configurable) number of repeated failures occurs
or the filesystem unmounts before the metadata could be written back
successfully.
There's good reason for this "wait for transients to resolve" method
of error handling - go back to the late 1990s and early 2000s and
high-end multi-path FC SAN based storage was well known to have
transient path failures that can take minutes to resolve before a
secondary path takes over. That was the sort of storage environment
XFS was designed to operate in, and those users expected the
filesystem to be extremely tolerant of transient failure conditions.
Hence failing an IO and shutting down the filesystem because there
are transient errors occuring in either the storage or the OS was
absolutely the wrong thing to be doing. It still is the wrong thing
to be doing - we want to wait until the transient error has
progressed to being classified as a permanent error before we take
drastic action like denying service to the filesystem.
Memory allocation failure has always been considered a transient
error by XFS that the OS will resolve in one way or another in a
realtively short period of time. If we're prepared to wait minutes
for IO path failures to resolve, waiting a couple of seconds for
transient low memory situations to resolve isn't a big deal.
Ranting about how we handle errors without understanding the error
model we are working within is not productive. bcachefs has a
different error handling model to almost every other filesystem out
there, but that doesn't mean every other filesystem must work the
same way that bcachefs does.
If changing this transient error handling model was as simple as
detecting an allocation failure and returning -ENOMEM, we would have
done that 20 years ago. But it isn't - the error handling model is
"block until transients resolve" so that the error handling paths
only need to handle fatal errors.
Therein lies the problem - those error handling paths need to be
substantially changed to be able to handle transient errors such as
ENOMEM. We'd need to either be able to back out of a dirty
transaction or restart the transaction in some way rather than
shutting down the filesystem.
Put simply: reclassifying ENOMEM from a "wait for transient to
resolve" handler to a "back out and restart" mechanism like bcachefs
uses requires re-architecting the entire log item architecture for
metadata modification tracking and journal space management.
Even if I knew how to implement this right now, it would require
years worth of engineering effort and resources before it would be
completed and ready for merge. Then it will take years more for all
the existing kernels to cycle out of production.
Further: this "ENOMEM is transient so retry" model has been used
without any significant issues in production systems for mission
critical infrastructure for the past 25+ years. There's a very
strong "if it ain't broke, don't fix it" argument to be made here.
The cost-benefit analysis comes out very strongly on the side of
keeping __GFP_NOFAIL semantics as they currently stand.
> Except that's bullshit; at the very least any dynamically sized
> allocation _definitely_ has to have an error path that's tested, and if
> there's questions about the context a code path might run in, that
> that's another reason.
We know the context we run __GFP_NOFAIL allocations in - transaction
context absolutely requires a task context because we take sleeping
locks, submit and wait on IO, do blocking memory allocation, etc. We
also know the size of the allocations because we've bounds checked
everything before we do an allocation.
Hence this argument of "__GFP_NOFAIL aboslutely requires error
checking because an invalid size or wonrg context might be used"
is completely irrelevant to XFS. If you call into the filesytsem
from an atomic context, you've lost long before we get to memory
allocation because filesystems take sleeping locks....
> GFP_NOFAIL is the problem here, and if it's encouraging this brain
> damaged "why can't we just get rid of error paths?" thinking, then it
> should be removed.
>
> Error paths have to exist, and they have to be tested.
__GFP_NOFAIL is *not the problem*, and we are not "avoiding error
handling". Backing off, looping and trying again is a valid
mechanism for handling transient failure conditions. Having a flag
that tells the allocator to "backoff, loop and try again" is a
perfectly good way of providing a generic error handling mechanism.
IOWs, Using __GFP_NOFAIL doesn't mean we are "not handling errors";
it simply means we have moved the error handling we were doing
inside the allocator. And yes, we test the hell out of this error
handling path....
-Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx