Re: [PATCH -next v6 1/2] xfs: reserve blocks for truncating large realtime inode
From: Dave Chinner
Date: Sun Jun 30 2024 - 21:16:19 EST
On Tue, Jun 18, 2024 at 10:21:11PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
>
> When unaligned truncate down a big realtime file, xfs_truncate_page()
> only zeros out the tail EOF block, __xfs_bunmapi() should split the tail
> written extent and convert the later one that beyond EOF block to
> unwritten, but it couldn't work as expected now since the reserved block
> is zero in xfs_setattr_size(), this could expose stale data just after
> commit '943bc0882ceb ("iomap: don't increase i_size if it's not a write
> operation")'.
>
> If we truncate file that contains a large enough written extent:
>
> |< rxext >|< rtext >|
> ...WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
> ^ (new EOF) ^ old EOF
>
> Since we only zeros out the tail of the EOF block, and
> xfs_itruncate_extents()->..->__xfs_bunmapi() unmap the whole ailgned
> extents, it becomes this state:
>
> |< rxext >|
> ...WWWzWWWWWWWWWWWWW
> ^ new EOF
>
> Then if we do an extending write like this, the blocks in the previous
> tail extent becomes stale:
>
> |< rxext >|
> ...WWWzSSSSSSSSSSSSS..........WWWWWWWWWWWWWWWWW
> ^ old EOF ^ append start ^ new EOF
>
> Fix this by reserving XFS_DIOSTRAT_SPACE_RES blocks for big realtime
> inode.
This same problem is going to happen with force aligned allocations,
right? i.e. it is a result of having a allocation block size larger
than one filesystem block?
If so, then....
> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
> Reviewed-by: Christoph Hellwig <hch@xxxxxx>
> ---
> fs/xfs/xfs_iops.c | 15 ++++++++++++++-
> 1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index ff222827e550..a00dcbc77e12 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -17,6 +17,8 @@
> #include "xfs_da_btree.h"
> #include "xfs_attr.h"
> #include "xfs_trans.h"
> +#include "xfs_trans_space.h"
> +#include "xfs_bmap_btree.h"
> #include "xfs_trace.h"
> #include "xfs_icache.h"
> #include "xfs_symlink.h"
> @@ -811,6 +813,7 @@ xfs_setattr_size(
> struct xfs_trans *tp;
> int error;
> uint lock_flags = 0;
> + uint resblks = 0;
> bool did_zeroing = false;
>
> xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL);
> @@ -917,7 +920,17 @@ xfs_setattr_size(
> return error;
> }
>
> - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
> + /*
> + * For realtime inode with more than one block rtextsize, we need the
> + * block reservation for bmap btree block allocations/splits that can
> + * happen since it could split the tail written extent and convert the
> + * right beyond EOF one to unwritten.
> + */
> + if (xfs_inode_has_bigrtalloc(ip))
> + resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0);
.... should this be doing this generic check instead:
if (xfs_inode_alloc_unitsize(ip) > 1)
resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0);
-Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx