Re: [PATCH v3 3/3] xfs: correct the zeroing truncate range

From: Zhang Yi
Date: Sat May 18 2024 - 02:35:30 EST


On 2024/5/18 1:59, Darrick J. Wong wrote:
> On Fri, May 17, 2024 at 07:13:55PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
>>
>> When truncating a realtime file unaligned to a shorter size,
>> xfs_setattr_size() only flush the EOF page before zeroing out, and
>> xfs_truncate_page() also only zeros the EOF block. This could expose
>> stale data since 943bc0882ceb ("iomap: don't increase i_size if it's not
>> a write operation").
>>
>> If the sb_rextsize is bigger than one block, and we have a realtime
>> inode that contains a long enough written extent. If we unaligned
>> truncate into the middle of this extent, xfs_itruncate_extents() could
>> split the extent and align the it's tail to sb_rextsize, there maybe
>> have more than one blocks more between the end of the file. Since
>> xfs_truncate_page() only zeros the trailing portion of the i_blocksize()
>> value, so it may leftover some blocks contains stale data that could be
>> exposed if we append write it over a long enough distance later.
>
> IOWs, any time we truncate down, we need to zero every byte from the new
> EOF all the way to the end of the allocation unit, correct?

Yeah.

>
> Maybe pictures would be easier to reason with. Say you have
> rextsize=30 and a partially written rtextent; each 'W' is a written
> fsblock and 'u' is an unwritten fsblock:
>
> WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
> ^ old EOF
>
> Now you want to truncate down:
>
> WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
> ^ new EOF ^ old EOF
>
> Currently, iomap_truncate_blocks only zeroes up to the next i_blocksize,
> so the truncate leaves the file in this state:
>
> WWWWWzWWWWWWWWWWWWWWWuuuuuuuuu
> ^ new EOF ^ old EOF
>
> (where 'z' is a written block with zeroes after EOF)
>
> This is bad because the "W"s between the new and old EOF still contain
> old credit card info or whatever. Now if we mmap the file or whatever,
> we can access those old contents.
>
> So your new patch amends iomap_truncate_page so that it'll zero all the
> way to the end of the @blocksize parameter. That fixes the exposure by
> writing zeroes to the pagecache before we truncate down:
>
> WWWWWzzzzzzzzzzzzzzzzuuuuuuuuu
> ^ new EOF ^ old EOF
>
> Is that correct?
>

Yes, it's correct. However, not only write zeros to the pagecache, but
also flush to disk, please see below for details.

> If so, then why don't we make xfs_truncate_page convert the post-eof
> rtextent blocks back to unwritten status:
>
> WWWWWzuuuuuuuuuuuuuuuuuuuuuuuu
> ^ new EOF ^ old EOF
>
> If we can do that, then do we need the changes to iomap_truncate_page?
> Converting the mapping should be much faster than dirtying potentially
> a lot of data (rt extents can be 1GB in size).

Now that the exposed stale data range (should be zeroed) is only one
rtextsize unit, if we convert the post-eof rtextent blocks to unwritten,
it breaks the alignment of rtextent and the definition of "extsize is used
to specify the size of the blocks in the real-time section of the
filesystem", is it fine? And IIUC, the upcoming xfs force alignment
extent feature seems also need to follow this alignment, right?

>
>> xfs_truncate_page() should flush, zeros out the entire rtextsize range,
>> and make sure the entire zeroed range have been flushed to disk before
>> updating the inode size.
>>
>> Fixes: 943bc0882ceb ("iomap: don't increase i_size if it's not a write operation")
>> Reported-by: Chandan Babu R <chandanbabu@xxxxxxxxxx>
>> Link: https://lore.kernel.org/linux-xfs/0b92a215-9d9b-3788-4504-a520778953c2@xxxxxxxxxxxxxxx
>> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
>> ---
>> fs/xfs/xfs_iomap.c | 35 +++++++++++++++++++++++++++++++----
>> fs/xfs/xfs_iops.c | 10 ----------
>> 2 files changed, 31 insertions(+), 14 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 4958cc3337bc..fc379450fe74 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -1466,12 +1466,39 @@ xfs_truncate_page(
>> loff_t pos,
>> bool *did_zero)
>> {
>> + struct xfs_mount *mp = ip->i_mount;
>> struct inode *inode = VFS_I(ip);
>> unsigned int blocksize = i_blocksize(inode);
>> + int error;
>> +
>> + if (XFS_IS_REALTIME_INODE(ip))
>> + blocksize = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize);
>
> Don't opencode xfs_inode_alloc_unitsize, please.

Ha, I missed the latest added helper, thanks for pointing this out.

>
>> +
>> + /*
>> + * iomap won't detect a dirty page over an unwritten block (or a
>> + * cow block over a hole) and subsequently skips zeroing the
>> + * newly post-EOF portion of the page. Flush the new EOF to
>> + * convert the block before the pagecache truncate.
>> + */
>> + error = filemap_write_and_wait_range(inode->i_mapping, pos,
>> + roundup_64(pos, blocksize));
>> + if (error)
>> + return error;pos_in_block
>
> Ok so this is hoisting the filemap_write_and_wait_range call from
> xfs_setattr_size. It's curious that we need to need to twiddle anything
> other than the EOF block itself though?

Since we planed to zero out the dirtied range which is ailgned to the
extsize instead of the blocksize, ensure one block is not unwritten is
not enough, we should also make sure that the range which is going to
zero out is not unwritten, or else the iomap_zero_iter() will skip
zeroing out the extra blocks.

For example:

before zeroing:
|<- extszie ->|
...dddddddddddddddddddd
...UUUUUUUUUUUUUUUUUUUU
^ ^
new EOF old EOF (where 'd' means the pagecache is dirty)

if we only flush the new EOF block, the result becomes:

|<- extszie ->|
zddddddddddddddddddd
ZUUUUUUUUUUUUUUUUUUU
^ ^
new EOF old EOF


then the dirty extent range that between new EOF block and the old EOF
block can't be zeroed sine it's still unwritten. So we have to flush the
whole range before zeroing out.

>
>>
>> if (IS_DAX(inode))
>> - return dax_truncate_page(inode, pos, blocksize, did_zero,
>> - &xfs_dax_write_iomap_ops);
>> - return iomap_truncate_page(inode, pos, blocksize, did_zero,
>> - &xfs_buffered_write_iomap_ops);
>> + error = dax_truncate_page(inode, pos, blocksize, did_zero,
>> + &xfs_dax_write_iomap_ops);
>> + else
>> + error = iomap_truncate_page(inode, pos, blocksize, did_zero,
>> + &xfs_buffered_write_iomap_ops);
>> + if (error)
>> + return error;
>> +
>> + /*
>> + * Write back path won't write dirty blocks post EOF folio,
>> + * flush the entire zeroed range before updating the inode
>> + * size.
>> + */
>> + return filemap_write_and_wait_range(inode->i_mapping, pos,
>> + roundup_64(pos, blocksize));
>
> ...but what is the purpose of the second filemap_write_and_wait_range
> call? Is that to flush the bytes between new and old EOF to disk before
> truncate_setsize invalidates the (zeroed) pagecache?
>

The second filemap_write_and_wait_range() call is used to make sure that
the zeroed data be flushed to disk before we updating i_size. If we don't
add this one, once the i_size is been changed, the zeroed data which
beyond the new EOF folio(block) couldn't be write back, because
iomap_writepage_map()->iomap_writepage_handle_eof() skip that range, so
the stale data problem is still there.

For example:

before zeroing:
|<- extszie ->|
wwwwwwwwwwwwwwwwwwww (pagecache)
...WWWWWWWWWWWWWWWWWWWW (disk)
^ ^
new EOF EOF (where 'w' means the pagecache contains data)

then iomap_truncate_page() zeroing out the pagecache:

|<- extszie ->|
zzzzzzzzzzzzzzzzzzzz (pagecache)
WWWWWWWWWWWWWWWWWWWW (disk)
^ ^
new EOF EOF

then update i_size, sync and drop cache:

|<- extszie ->|
ZWWWWWWWWWWWWWWWWWWW (disk)
^
EOF

Thanks,
Yi.

>
>> }
>> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
>> index 66f8c47642e8..baeeddf4a6bb 100644
>> --- a/fs/xfs/xfs_iops.c
>> +++ b/fs/xfs/xfs_iops.c
>> @@ -845,16 +845,6 @@ xfs_setattr_size(
>> error = xfs_zero_range(ip, oldsize, newsize - oldsize,
>> &did_zeroing);
>> } else {
>> - /*
>> - * iomap won't detect a dirty page over an unwritten block (or a
>> - * cow block over a hole) and subsequently skips zeroing the
>> - * newly post-EOF portion of the page. Flush the new EOF to
>> - * convert the block before the pagecache truncate.
>> - */
>> - error = filemap_write_and_wait_range(inode->i_mapping, newsize,
>> - newsize);
>> - if (error)
>> - return error;
>> error = xfs_truncate_page(ip, newsize, &did_zeroing);
>> }
>>
>> --
>> 2.39.2
>>
>>