Re: [PATCH v3 3/3] xfs: correct the zeroing truncate range

From: Zhang Yi
Date: Mon May 20 2024 - 03:11:58 EST


On 2024/5/20 14:56, Zhang Yi wrote:
> On 2024/5/19 3:26, Darrick J. Wong wrote:
>> On Sat, May 18, 2024 at 02:35:02PM +0800, Zhang Yi wrote:
>>> On 2024/5/18 1:59, Darrick J. Wong wrote:
>>>> On Fri, May 17, 2024 at 07:13:55PM +0800, Zhang Yi wrote:
>>>>> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
>>>>>
>>>>> When truncating a realtime file unaligned to a shorter size,
>>>>> xfs_setattr_size() only flush the EOF page before zeroing out, and
>>>>> xfs_truncate_page() also only zeros the EOF block. This could expose
>>>>> stale data since 943bc0882ceb ("iomap: don't increase i_size if it's not
>>>>> a write operation").
>>>>>
>>>>> If the sb_rextsize is bigger than one block, and we have a realtime
>>>>> inode that contains a long enough written extent. If we unaligned
>>>>> truncate into the middle of this extent, xfs_itruncate_extents() could
>>>>> split the extent and align the it's tail to sb_rextsize, there maybe
>>>>> have more than one blocks more between the end of the file. Since
>>>>> xfs_truncate_page() only zeros the trailing portion of the i_blocksize()
>>>>> value, so it may leftover some blocks contains stale data that could be
>>>>> exposed if we append write it over a long enough distance later.
>>
>> Hum. Is this an appending write into the next rtextent? For example,
>> if you start with a file like this:
>>
>> WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
>> ^ old EOF
>>
>> Then truncate it improperly like this:
>>
>> WWWWWzWWWWWWWWWWWWWWWuuuuuuuuu
>> ^ new EOF
>>
>> Then do an extending write like this:
>>
>> WWWWWzWWWWWWWWWWWWWWWuuuuuuuuuuuuuuuuuuuuuuuuuuuWWWuuuuuuuuu
>> ^ EOF ^ next rtx ^ append here
>>
>> And now the problem is that we've exposed stale data that should be
>> zeroes?
>>
>> WWWWWzWWWWWWWWWWWWWWWuuuuuuuuuuuuuuuuuuuuuuuuuuuWWWuuuuuuuuu
>> ^^^^^^^^^^^^^^^ ^ new EOF
>> should be zeroed
>>
>
> Yeah.
>
>>>>
>>>> IOWs, any time we truncate down, we need to zero every byte from the new
>>>> EOF all the way to the end of the allocation unit, correct?
>>>
>>> Yeah.
>>>
>>>>
>>>> Maybe pictures would be easier to reason with. Say you have
>>>> rextsize=30 and a partially written rtextent; each 'W' is a written
>>>> fsblock and 'u' is an unwritten fsblock:
>>>>
>>>> WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
>>>> ^ old EOF
>>>>
>>>> Now you want to truncate down:
>>>>
>>>> WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
>>>> ^ new EOF ^ old EOF
>>>>
>>>> Currently, iomap_truncate_blocks only zeroes up to the next i_blocksize,
>>>> so the truncate leaves the file in this state:
>>>>
>>>> WWWWWzWWWWWWWWWWWWWWWuuuuuuuuu
>>>> ^ new EOF ^ old EOF
>>>>
>>>> (where 'z' is a written block with zeroes after EOF)
>>>>
>>>> This is bad because the "W"s between the new and old EOF still contain
>>>> old credit card info or whatever. Now if we mmap the file or whatever,
>>>> we can access those old contents.
>>>>
>>>> So your new patch amends iomap_truncate_page so that it'll zero all the
>>>> way to the end of the @blocksize parameter. That fixes the exposure by
>>>> writing zeroes to the pagecache before we truncate down:
>>>>
>>>> WWWWWzzzzzzzzzzzzzzzzuuuuuuuuu
>>>> ^ new EOF ^ old EOF
>>>>
>>>> Is that correct?
>>>>
>>>
>>> Yes, it's correct. However, not only write zeros to the pagecache, but
>>> also flush to disk, please see below for details.
>>
>> <nod> iomap_truncate_page writes zeroes to any part of the pagecache
>> backed by written extents, and then xfs must call
>> filemap_write_and_wait_range to write the dirty (zeroed) cache out to
>> disk.
>>
>>>> If so, then why don't we make xfs_truncate_page convert the post-eof
>>>> rtextent blocks back to unwritten status:
>>>>
>>>> WWWWWzuuuuuuuuuuuuuuuuuuuuuuuu
>>>> ^ new EOF ^ old EOF
>>>>
>>>> If we can do that, then do we need the changes to iomap_truncate_page?
>>>> Converting the mapping should be much faster than dirtying potentially
>>>> a lot of data (rt extents can be 1GB in size).
>>>
>>> Now that the exposed stale data range (should be zeroed) is only one
>>> rtextsize unit, if we convert the post-eof rtextent blocks to unwritten,
>>> it breaks the alignment of rtextent and the definition of "extsize is used
>>> to specify the size of the blocks in the real-time section of the
>>> filesystem", is it fine?
>>
>> A written -> unwritten extent conversion doesn't change which physical
>> space extent is mapped to the file data extent; it merely marks the
>> mapping as unwritten.
>>
>> For example, if you start with this mapping:
>>
>> {startoff = 8, startblock 256, blockcount = 8, state = written}
>>
>> and then convert blocks 13-15 to unwritten, you get:
>>
>> {startoff = 8, startblock 256, blockcount = 5, state = written}
>> {startoff = 13, startblock 261, blockcount = 3, state = unwritten}
>>
>> File blocks 8-15 still map to physical space 256-263.
>
> Yeah, indeed.
>
>>
>> In xfs, the entire allocation unit is /always/ mapped to the file, even
>> if parts of it have to be unwritten. Hole punching on rt, for example,
>> converts the punched region to unwritten. This is (iirc) the key
>> difference between xfs rt and ext4 bigalloc. xfs doesn't have or need
>> (or want) the implied cluster allocation code that ext4 has.
>>
>
> I checked the xfs_file_fallocate(), it looks like hole punching on realtime
> inode is still follow the rtextsize alignment, i.e. if we punch hole on a
> file that only contains one written extent, it doesn't split it and convet
> the punched range to unwritten. Please take a look at
> xfs_file_fallocate()->xfs_free_file_space(), it aligned the freeing range
> and zeroing out the whole unligned range in one reextsize unit, and
> FALLOC_FL_ZERO_RANGE is the same.
>
> 836 /* We can only free complete realtime extents. */
> 837 if (xfs_inode_has_bigrtalloc(ip)) {
> 838 startoffset_fsb = xfs_rtb_roundup_rtx(mp, startoffset_fsb);
> 839 endoffset_fsb = xfs_rtb_rounddown_rtx(mp, endoffset_fsb);
> 840 }
> ...
> 864 error = xfs_zero_range(ip, offset, len, NULL);
>
> And I tested it on my machine, it's true that it doesn't do the convertion.
>
> # mkfs.xfs -f -rrtdev=/dev/nvme0n1 -f -m reflink=0,rmapbt=0, -d rtinherit=1 -r extsize=28k /dev/pmem2s
> # mount -ortdev=/dev/nvme0n1 /dev/pmem2s /mnt/scratch
> # xfs_io -f -c "pwrite 0 28k" -c "fsync" /mnt/scratch/foo
> # xfs_io -c "fpunch 4k 24k" /mnt/scratch/foo
> # umount /mnt/scratch
>
> # xfs_db -c "inode 131" -c "p u3.bmx" /dev/pmem2s
> u3.bmx[0] = [startoff,startblock,blockcount,extentflag]
> 0:[0,0,7,0]
>
> Am I missed something?
>
>> I can't tell if there's something that you see that I don't see such
>> that we really /do/ need to actually write zeroes to the entire tail of
>> the rtextent; or if you weren't sure that forcing all the post-eof
>> fsblocks in the rtextent to unwritten (and zapping the pagecache) would
>> actually preserve the rtextsize alignment.
>
> I haven't found any restrictions yet, and I also noticed that a simple
> write is not guaranteed to align the extent to rtextsize, since the write
> back path doesn't zeroing out the extra blocks that align to the
> rtextsize.
>
> # #extsize=28k
> # xfs_io -d -f -c "pwrite 0 4k" -c "fsync" /mnt/scratch/foo
> # xfs_db -c "inode 131" -c "p u3.bmx" /dev/pmem2s
> u3.bmx[0-1] = [startoff,startblock,blockcount,extentflag]
> 0:[0,0,1,0]
> 1:[1,1,6,1]
>
> So I guess convert the tail fsblocks of the rtextent to unwritten status
> could work. However, I'm a little confused, besides the write operation,
> other operations like above punch hold and zero range, they seem to be
> doing their best to follow the alignment rule since commit fe341eb151ec
> ("xfs: ensure that fpunch, fcollapse, and finsert operations are aligned
> to rt extent size") [1], it looks like this commit is to fix some issues,
> so I'm not sure that converting to unwritten would always preserve the
> rtextsize alignment.
>
> [1]. https://lore.kernel.org/linux-xfs/159950168085.582172.4254559621934598919.stgit@magnolia/
>
>>
>> (Or if there's something else?)
>>
>>> And IIUC, the upcoming xfs force alignment
>>> extent feature seems also need to follow this alignment, right?
>>
>> Yes.
>>
>>>>
>>>>> xfs_truncate_page() should flush, zeros out the entire rtextsize range,
>>>>> and make sure the entire zeroed range have been flushed to disk before
>>>>> updating the inode size.
>>>>>
>>>>> Fixes: 943bc0882ceb ("iomap: don't increase i_size if it's not a write operation")
>>>>> Reported-by: Chandan Babu R <chandanbabu@xxxxxxxxxx>
>>>>> Link: https://lore.kernel.org/linux-xfs/0b92a215-9d9b-3788-4504-a520778953c2@xxxxxxxxxxxxxxx
>>>>> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
>>>>> ---
>>>>> fs/xfs/xfs_iomap.c | 35 +++++++++++++++++++++++++++++++----
>>>>> fs/xfs/xfs_iops.c | 10 ----------
>>>>> 2 files changed, 31 insertions(+), 14 deletions(-)
>>>>>
>>>>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>>>>> index 4958cc3337bc..fc379450fe74 100644
>>>>> --- a/fs/xfs/xfs_iomap.c
>>>>> +++ b/fs/xfs/xfs_iomap.c
>>>>> @@ -1466,12 +1466,39 @@ xfs_truncate_page(
>>>>> loff_t pos,
>>>>> bool *did_zero)
>>>>> {
>>>>> + struct xfs_mount *mp = ip->i_mount;
>>>>> struct inode *inode = VFS_I(ip);
>>>>> unsigned int blocksize = i_blocksize(inode);
>>>>> + int error;
>>>>> +
>>>>> + if (XFS_IS_REALTIME_INODE(ip))
>>>>> + blocksize = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize);
>>>>
>>>> Don't opencode xfs_inode_alloc_unitsize, please.
>>>
>>> Ha, I missed the latest added helper, thanks for pointing this out.
>>>
>>>>
>>>>> +
>>>>> + /*
>>>>> + * iomap won't detect a dirty page over an unwritten block (or a
>>>>> + * cow block over a hole) and subsequently skips zeroing the
>>>>> + * newly post-EOF portion of the page. Flush the new EOF to
>>>>> + * convert the block before the pagecache truncate.
>>>>> + */
>>>>> + error = filemap_write_and_wait_range(inode->i_mapping, pos,
>>>>> + roundup_64(pos, blocksize));
>>>>> + if (error)
>>>>> + return error;pos_in_block
>>>>
>>>> Ok so this is hoisting the filemap_write_and_wait_range call from
>>>> xfs_setattr_size. It's curious that we need to need to twiddle anything
>>>> other than the EOF block itself though?
>>>
>>> Since we planed to zero out the dirtied range which is ailgned to the
>>> extsize instead of the blocksize, ensure one block is not unwritten is
>>> not enough, we should also make sure that the range which is going to
>>> zero out is not unwritten, or else the iomap_zero_iter() will skip
>>> zeroing out the extra blocks.
>>>
>>> For example:
>>>
>>> before zeroing:
>>> |<- extszie ->|
>>> ...dddddddddddddddddddd
>>> ...UUUUUUUUUUUUUUUUUUUU
>>> ^ ^
>>> new EOF old EOF (where 'd' means the pagecache is dirty)
>>>
>>> if we only flush the new EOF block, the result becomes:
>>>
>>> |<- extszie ->|
>>> zddddddddddddddddddd
>>> ZUUUUUUUUUUUUUUUUUUU
>>> ^ ^
>>> new EOF old EOF
>>>
>>>
>>> then the dirty extent range that between new EOF block and the old EOF
>>> block can't be zeroed sine it's still unwritten. So we have to flush the
>>> whole range before zeroing out.
>>
>> "Z" on the second line of the second diagram is a written fsblock with
>> the tail zeroed, correct?
>
> Yeah,
>
>>
>> truncate_setsize -> truncate_pagecache unmaps all the pagecache after
>> the eof folio and unconditionally zeroes the tail of the eof folio
>> without regard to the mappings. Doesn't that cover us here? After the
>> truncate_setsize finishes, won't we end up in this state:
>>
>> |<- rextsize ->|
>> zzzzzzzz
>> ZUUUUUUUUUUUUUUUUUUU
>> ^ ^ ^
>> new EOF | old EOF
>> folio boundary
>>
>
> Yeah, this case is fine, but the below case is not fine.
>
> truncate write back
> xfs_setattr_size()
> xfs_truncate_page()
> filemap_write_and_wait_range(newsize, newsize) <- A
> iomap_zero_range() <- B
> flush dirty pages <- C
> truncate_setsize() <- D
>
> Please assume if a concurrent write back happenes just before
> truncate_setsize(), the state of the file changes as below:
>
> A:
> |<- extszie ->|
> wddddddddddddddddddd (pagecache)
> WUUUUUUUUUUUUUUUUUUU (disk)
> ^ ^
> (new EOF) old EOF (where 'd' means the pagecache is dirty)
> (where 'x' means the pagecache contianes user data)
^^^
w
>
> B:
> |<- extszie ->|
> zddddddddddddddddddd
> ZUUUUUUUUUUUUUUUUUUU
> ^ ^
> (new EOF) old EOF (where 'z' means the pagecache is zero)
>
> C:
> |<- extszie ->|
> zwwwwwwwwwwwwwwwwwww
> ZWWWWWWWWWWWWWWWWWWW
> ^ ^
> (new EOF) old EOF
>
> D:
> |<- extszie ->|
> zzzzzzzzz
> ZWWWWWWWWWWWWWWWWWWW
> ^ ^ ^
> new EOF | (old EOF)
> folio boundary
>
>
> Thanks,
> Yi.