Re: [PATCH] xfs: Fix file creation failure
From: Darrick J. Wong
Date: Tue Jun 04 2024 - 19:54:06 EST
On Wed, Jun 05, 2024 at 09:00:28AM +1000, Dave Chinner wrote:
> On Tue, Jun 04, 2024 at 03:11:21PM +0800, Zizhi Wo wrote:
> > We have an xfs image that contains only 2 AGs, the first AG is full and
> > the second AG is empty, then a concurrent file creation and little writing
> > could unexpectedly return -ENOSPC error since there is a race window that
> > the allocator could get the wrong agf->agf_longest.
> >
> > Write file process steps:
> > 1) Find the entry that best meets the conditions, then calculate the start
> > address and length of the remaining part of the entry after allocation.
> > 2) Delete this entry. Because the second AG is empty, the btree in its agf
> > has only one record, and agf->agf_longest will be set to 0 after deletion.
> > 3) Insert the remaining unused parts of this entry based on the
> > calculations in 1), and update the agf->agf_longest.
> >
> > Create file process steps:
> > 1) Check whether there are free inodes in the inode chunk.
> > 2) If there is no free inode, check whether there has space for creating
> > inode chunks, perform the no-lock judgment first.
> > 3) If the judgment succeeds, the judgment is performed again with agf lock
> > held. Otherwire, an error is returned directly.
> >
> > If the write process is in step 2) but not go to 3) yet, the create file
> > process goes to 2) at this time, it will be mistaken for no space,
> > resulting in the file system still has space but the file creation fails.
> >
> > Direct write Create file
> > xfs_file_write_iter
> > ...
> > xfs_direct_write_iomap_begin
> > xfs_iomap_write_direct
> > ...
> > xfs_alloc_ag_vextent_near
> > xfs_alloc_cur_finish
> > xfs_alloc_fixup_trees
> > xfs_btree_delete
> > xfs_btree_delrec
> > xfs_allocbt_update_lastrec
> > // longest = 0 because numrec == 0.
> > agf->agf_longest = len = 0
> > xfs_create
> > ...
> > xfs_dialloc
> > ...
> > xfs_alloc_fix_freelist
> > xfs_alloc_space_available
> > -> as longest=0, it will return
> > false, no space for inode alloc.
>
> Ok, so this is a another attempt to address the problem Ye Bin
> attempted to fix here:
>
> https://lore.kernel.org/linux-xfs/20240419061848.1032366-1-yebin10@xxxxxxxxxx/
>
> > Fix this issue by adding the bc_free_longest field to the xfs_btree_cur_t
> > structure to store the potential longest count that will be updated. The
> > assignment is done in xfs_alloc_fixup_trees() and xfs_free_ag_extent().
>
> I outlined how this should be fixed in the above thread:
>
> https://lore.kernel.org/linux-xfs/ZiWgRGWVG4aK1165@xxxxxxxxxxxxxxxxxxx/
>
> This is what I said:
>
> | What we actually want is for pag->pagf_longest not to change
> | transiently to zero in xfs_alloc_fixup_trees(). If the delrec that
> | zeroes the pagf_longest field is going to follow it up with an
> | insrec that will set it back to a valid value, we really should not
> | be doing the zeroing in the first place.
> |
> | Further, the only btree that tracks the right edge of the btree is
> | the by-count allocbt. This isn't "generic" functionality, even
> | though it is implemented through the generic btree code. If we lift
> | ->update_lastrec from the generic code and do it directly in
> | xfs_alloc.c whenever we are finished with a by-count tree update
> | and the cursor points to a record in the right-most leaf of the
> | tree, then we run the lastrec update directly at that point.
> | By decoupling the lastrec updates from the individual record
> | manipulations, we can make the transients disappear completely.
>
> I'm not sure if this patch is an attempt to implement this - there
> is no reference in the commit description to this previous attempt
> to fix the issue, nor is the any discussion of why this particular
> solution was chosen.
>
> In future, when you are trying to fix an issue that has previously
> been discussed/presented on the list, please reference it and
> provide a link to the previous discussions in the changelog for the
> new version of the patchset fixing the issue.
Aha, that's why this seemed oddly familiar.
--D
> > Reported by: Ye Bin <yebin10@xxxxxxxxxx>
> > Signed-off-by: Zizhi Wo <wozizhi@xxxxxxxxxx>
> > ---
> > fs/xfs/libxfs/xfs_alloc.c | 14 ++++++++++++++
> > fs/xfs/libxfs/xfs_alloc_btree.c | 9 ++++++++-
> > fs/xfs/libxfs/xfs_btree.h | 1 +
> > 3 files changed, 23 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > index 6c55a6e88eba..86ba873d57a8 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > @@ -577,6 +577,13 @@ xfs_alloc_fixup_trees(
> > nfbno2 = rbno + rlen;
> > nflen2 = (fbno + flen) - nfbno2;
> > }
> > +
> > + /*
> > + * Record the potential maximum free length in advance.
> > + */
> > + if (nfbno1 != NULLAGBLOCK || nfbno2 != NULLAGBLOCK)
> > + cnt_cur->bc_ag.bc_free_longest = XFS_EXTLEN_MAX(nflen1, nflen2);
> > +
>
> Why do we store the length of a random extent being freed here?
> nflen1/2 almost always have nothing to do with the longest free
> space extent in the tree, they are just the new free space extents
> we are insering into a random location in the free space tree.
>
> > /*
> > * Delete the entry from the by-size btree.
> > */
> > @@ -2044,6 +2051,13 @@ xfs_free_ag_extent(
> > * Now allocate and initialize a cursor for the by-size tree.
> > */
> > cnt_cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag);
> > + /*
> > + * Record the potential maximum free length in advance.
> > + */
> > + if (haveleft)
> > + cnt_cur->bc_ag.bc_free_longest = ltlen;
> > + if (haveright)
> > + cnt_cur->bc_ag.bc_free_longest = gtlen;
>
> That doesn't look correct. At this point in the code, ltlen/gtlen
> are the sizes of the physically adjacent freespace extents that we
> are going to merge the newly freed extent with. i.e. the new
> freespace extent is going to have one of 4 possible values:
>
> no merge: len
> left merge: ltlen + len
> right merge: gtlen + len
> both merge: ltlen + gtlen + len
>
> So regardless of anything else, this code isn't setting the longest
> freespace extent in teh AGF to the lenght of the longest freespace
> extent in the filesystem.
>
> Which leads me to ask: how did you test this code? This bug should
> have been triggering verifier, repair and scrub failures quite
> quickly with fstests....
>
> > /*
> > * Have both left and right contiguous neighbors.
> > * Merge all three into a single free block.
> > diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
> > index 6ef5ddd89600..8e7d1e0f1a63 100644
> > --- a/fs/xfs/libxfs/xfs_alloc_btree.c
> > +++ b/fs/xfs/libxfs/xfs_alloc_btree.c
> > @@ -161,7 +161,14 @@ xfs_allocbt_update_lastrec(
> > rrp = XFS_ALLOC_REC_ADDR(cur->bc_mp, block, numrecs);
> > len = rrp->ar_blockcount;
> > } else {
> > - len = 0;
> > + /*
> > + * Update in advance to prevent file creation failure
> > + * for concurrent processes even though there is no
> > + * numrec currently.
> > + * And there's no need to worry as the value that no
> > + * less than bc_free_longest will be inserted later.
> > + */
> > + len = cpu_to_be32(cur->bc_ag.bc_free_longest);
> > }
>
> So this is in the LASTREC_DELREC case when the last record is
> removed from the btree. This is what causes the transient state
> as we do this when deleting a record to trim it and then re-insert
> the remainder back into the by-count btree.
>
> Writing some random transient value into the AGF *and journalling
> it* means we creating a transient on-disk format structure
> corruption and potentially writing it to persistent storage (i.e.
> the journal). The structure is, at least, not consistent in memory
> because the free space tree is empty at this point in time, whilst
> the agf->longest field says it has a free space available. This
> could trip verifiers, be flagged as corruption by xfs_scrub/repair,
> etc.
>
> Now, this *might be safe* because we *may* clean it up later in the
> transaction, but if this really is the last extent being removed
> from the btree and a cursor has previously been used to do other
> insert and free operations that set this field, then we trip over
> this stale inforamtion and write a corrupt structure to disk. That's
> not good.
>
> As I said above, this "last record tracking" needs to be ripped out
> of the generic btree code because only the by-count btree uses it.
> Then it can be updated at the end of the by-count btree update
> process in the allocation code (i.e. after all record manipulations
> are done in the transaction) and that avoids this transient caused
> by updating the last record on every btree record update that is
> done.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx
>