Re: [PATCH v3 3/3] xfs: correct the zeroing truncate range

From: Zhang Yi
Date: Tue May 21 2024 - 21:57:39 EST


On 2024/5/21 10:38, Dave Chinner wrote:
> On Fri, May 17, 2024 at 07:13:55PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
>>
>> When truncating a realtime file unaligned to a shorter size,
>> xfs_setattr_size() only flush the EOF page before zeroing out, and
>> xfs_truncate_page() also only zeros the EOF block. This could expose
>> stale data since 943bc0882ceb ("iomap: don't increase i_size if it's not
>> a write operation").
>>
>> If the sb_rextsize is bigger than one block, and we have a realtime
>> inode that contains a long enough written extent. If we unaligned
>> truncate into the middle of this extent, xfs_itruncate_extents() could
>> split the extent and align the it's tail to sb_rextsize, there maybe
>> have more than one blocks more between the end of the file. Since
>> xfs_truncate_page() only zeros the trailing portion of the i_blocksize()
>> value, so it may leftover some blocks contains stale data that could be
>> exposed if we append write it over a long enough distance later.
>>
>> xfs_truncate_page() should flush, zeros out the entire rtextsize range,
>> and make sure the entire zeroed range have been flushed to disk before
>> updating the inode size.
>>
>> Fixes: 943bc0882ceb ("iomap: don't increase i_size if it's not a write operation")
>> Reported-by: Chandan Babu R <chandanbabu@xxxxxxxxxx>
>> Link: https://lore.kernel.org/linux-xfs/0b92a215-9d9b-3788-4504-a520778953c2@xxxxxxxxxxxxxxx
>> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
>> ---
>> fs/xfs/xfs_iomap.c | 35 +++++++++++++++++++++++++++++++----
>> fs/xfs/xfs_iops.c | 10 ----------
>> 2 files changed, 31 insertions(+), 14 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 4958cc3337bc..fc379450fe74 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -1466,12 +1466,39 @@ xfs_truncate_page(
>> loff_t pos,
>> bool *did_zero)
>> {
>> + struct xfs_mount *mp = ip->i_mount;
>> struct inode *inode = VFS_I(ip);
>> unsigned int blocksize = i_blocksize(inode);
>> + int error;
>> +
>> + if (XFS_IS_REALTIME_INODE(ip))
>> + blocksize = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize);
>> +
>> + /*
>> + * iomap won't detect a dirty page over an unwritten block (or a
>> + * cow block over a hole) and subsequently skips zeroing the
>> + * newly post-EOF portion of the page. Flush the new EOF to
>> + * convert the block before the pagecache truncate.
>> + */
>> + error = filemap_write_and_wait_range(inode->i_mapping, pos,
>> + roundup_64(pos, blocksize));
>> + if (error)
>> + return error;
>>
>> if (IS_DAX(inode))
>> - return dax_truncate_page(inode, pos, blocksize, did_zero,
>> - &xfs_dax_write_iomap_ops);
>> - return iomap_truncate_page(inode, pos, blocksize, did_zero,
>> - &xfs_buffered_write_iomap_ops);
>> + error = dax_truncate_page(inode, pos, blocksize, did_zero,
>> + &xfs_dax_write_iomap_ops);
>> + else
>> + error = iomap_truncate_page(inode, pos, blocksize, did_zero,
>> + &xfs_buffered_write_iomap_ops);
>> + if (error)
>> + return error;
>> +
>> + /*
>> + * Write back path won't write dirty blocks post EOF folio,
>> + * flush the entire zeroed range before updating the inode
>> + * size.
>> + */
>> + return filemap_write_and_wait_range(inode->i_mapping, pos,
>> + roundup_64(pos, blocksize));
>> }
>
> Ok, this means we do -three- blocking writebacks through this path
> instead of one or maybe two.
>
> We already know that this existing blocking writeback case for dirty
> pages over unwritten extents is a significant performance issue for
> some workloads. I have a fix in progress for iomap to handle this
> case without requiring blocking writeback to be done to convert the
> extent to written before we do the truncate.
>
> Regardless, I think this whole "truncate is allocation unit size
> aware" algorithm is largely unworkable without a rewrite. What XFS
> needs to do on truncate *down* before we start the truncate
> transaction is pretty simple:
>
> - ensure that the new EOF extent tail contains zeroes
> - ensure that the range from the existing ip->i_disk_size to
> the new EOF is on disk so data vs metadata ordering is
> correct for crash recovery purposes.
>
> What this patch does to acheive that is:
>
> 1. blocking writeback to clean dirty unwritten/cow blocks at
> the new EOF.
> 2. iomap_truncate_page() writes zeroes into the page cache,
> which dirties the pages we just cleaned at the new EOF.
> 3. blocking writeback to clean the dirty blocks at the new
> EOF.
> 4. truncate_setsize() then writes zeros to partial folios at
> the new EOF, dirtying the EOF page again.
> 5. blocking writeback to clean dirty blocks from the current
> on-disk size to the new EOF.
>
> This is pretty crazy when you stop and think about it. We're writing
> the same EOF block -three- times. The first data write gets
> overwritten by zeroes on the second write, and the third write
> writes the same zeroes as the second write. There are two redundant
> *blocking* writes in this process.

Yes, this is indeed a performance disaster, and iomap_zero_range()
should aware the dirty pages. I had the same problem when developing
buffered iomap conversion on ext4.

>
> We can do all this with a single writeback operation if we are a
> little bit smarter about the order of operations we perform and we
> are a little bit smarter in iomap about zeroing dirty pages in the
> page cache:
>
> 1. change iomap_zero_range() to do the right thing with
> dirty unwritten and cow extents (the patch I've been working
> on).
>
> 2. pass the range to be zeroed into iomap_truncate_page()
> (the fundamental change being made here).
>
> 3. zero the required range *through the page cache*
> (iomap_zero_range() already does this).
>
> 4. write back the XFS inode from ip->i_disk_size to the end
> of the range zeroed by iomap_truncate_page()
> (xfs_setattr_size() already does this).
>
> 5. i_size_write(newsize);
>
> 6. invalidate_inode_pages2_range(newsize, -1) to trash all
> the page cache beyond the new EOF without doing any zeroing
> as we've already done all the zeroing needed to the page
> cache through iomap_truncate_page().
>
>
> The patch I'm working on for step 1 is below. It still needs to be
> extended to handle the cow case, but I'm unclear on how to exercise
> that case so I haven't written the code to do it. The rest of it is
> just rearranging the code that we already use just to get the order
> of operations right. The only notable change in behaviour is using
> invalidate_inode_pages2_range() instead of truncate_pagecache(),
> because we don't want the EOF page to be dirtied again once we've
> already written zeroes to disk....
>

Indeed, this sounds like the best solution. Since Darrick recommended
that we could fix the stale data exposure on realtime inode issue by
convert the tail extent to unwritten, I suppose we could do this after
fixing the problem.

Thanks,
Yi.