Re: [PATCH v3 4/9] xfs: convert delayed extents to unwritten when zeroing post eof blocks

From: Darrick J. Wong
Date: Tue Mar 19 2024 - 17:00:29 EST


On Tue, Mar 19, 2024 at 09:10:57AM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
>
> Current clone operation could be non-atomic if the destination of a file
> is beyond EOF, user could get a file with corrupted (zeroed) data on
> crash.
>
> The problem is about to pre-alloctions. If you write some data into a

"...is about preallocations."

> file [A, B) (the position letters are increased one by one), and xfs

I think it would help with understandability if you'd pasted the ascii
art from the previous thread(s) into this commit message:

"The problem is about preallocations. If you write some data into a
file:

[A...B)

and XFS decides to preallocate some post-eof blocks, then it can create
a delayed allocation reservation:

[A.........D)

The writeback path tries to convert delayed extents to real ones by
allocating blocks. If there aren't enough contiguous free space, we can
end up with two extents, the first real and the second still delalloc:

[A....C)[C.D)

After that, both the in-memory and the on-disk file sizes are still B.
If we clone into the range [E...F) from another file:

[A....C)[C.D) [E...F)

then xfs_reflink_zero_posteof calls iomap_zero_range to zero out the
range [B, E) beyond EOF and flush it. Since [C, D) is still a delalloc
extent, its pagecache will be zeroed and both the in-memory and on-disk
size will be updated to D after flushing but before cloning. This is
wrong, because the user can see the size change and read the zeroes
while the clone operation is ongoing."

> could pre-allocate some blocks, then we get a delayed extent [A, D).
> Then the writeback path allocate blocks and convert this delayed extent
> [A, C) since lack of enough contiguous physical blocks, so the extent
> [C, D) is still delayed. After that, both the in-memory and the on-disk
> file size are B. If we clone file range into [E, F) from another file,
> xfs_reflink_zero_posteof() would call iomap_zero_range() to zero out the
> range [B, E) beyond EOF and flush range. Since [C, D) is still a delayed
> extent, it will be zeroed and the file's in-memory && on-disk size will
> be updated to D after flushing and before doing the clone operation.
> This is wrong, because user can user can see the size change and read
> zeros in the middle of the clone operation.
>
> We need to keep the in-memory and on-disk size before the clone
> operation starts, so instead of writing zeroes through the page cache
> for delayed ranges beyond EOF, we convert these ranges to unwritten and
> invalidating any cached data over that range beyond EOF.

"...and invalidate any cached data..."

> Suggested-by: Dave Chinner <david@xxxxxxxxxxxxx>
> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
> Reviewed-by: Christoph Hellwig <hch@xxxxxx>
> ---
> fs/xfs/xfs_iomap.c | 29 +++++++++++++++++++++++++++++
> 1 file changed, 29 insertions(+)
>
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index ccf83e72d8ca..1a6d05830433 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1035,6 +1035,24 @@ xfs_buffered_write_iomap_begin(
> }
>
> if (imap.br_startoff <= offset_fsb) {
> + /*
> + * For zeroing out delayed allocation extent, we trim it if
> + * it's partial beyonds EOF block, or convert it to unwritten
> + * extent if it's all beyonds EOF block.

"Trim a delalloc extent that extends beyond the EOF block. If it starts
beyond the EOF block, convert it to an unwritten extent."

> + */
> + if ((flags & IOMAP_ZERO) &&
> + isnullstartblock(imap.br_startblock)) {
> + xfs_fileoff_t eof_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
> +
> + if (offset_fsb >= eof_fsb)
> + goto convert_delay;
> + if (end_fsb > eof_fsb) {
> + end_fsb = eof_fsb;
> + xfs_trim_extent(&imap, offset_fsb,
> + end_fsb - offset_fsb);
> + }
> + }
> +
> /*
> * For reflink files we may need a delalloc reservation when
> * overwriting shared extents. This includes zeroing of
> @@ -1158,6 +1176,17 @@ xfs_buffered_write_iomap_begin(
> xfs_iunlock(ip, lockmode);
> return xfs_bmbt_to_iomap(ip, iomap, &imap, flags, 0, seq);
>
> +convert_delay:
> + xfs_iunlock(ip, lockmode);
> + truncate_pagecache(inode, offset);
> + error = xfs_bmapi_convert_delalloc(ip, XFS_DATA_FORK, offset,
> + iomap, NULL);

Either indent this two tabs (XFS style), or make the second line of
arguments line up with the start of the arguments on the first line
(kernel style).

With those improvements applied,
Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx>

--D

> + if (error)
> + return error;
> +
> + trace_xfs_iomap_alloc(ip, offset, count, XFS_DATA_FORK, &imap);
> + return 0;
> +
> found_cow:
> seq = xfs_iomap_inode_sequence(ip, 0);
> if (imap.br_startoff <= offset_fsb) {
> --
> 2.39.2
>
>