Re: [PATCH 01/27] ext4: remove writable userspace mappings before truncating page cache

From: Zhang Yi
Date: Fri Dec 06 2024 - 03:00:07 EST


On 2024/12/4 19:13, Jan Kara wrote:
> I'm sorry for the huge delay here...
>
It's fine, I know you're probably been busy lately, and this series has
undergone significant modifications, which should require considerable
time for review. Thanks a lot for taking time to review this series!

> On Tue 22-10-24 19:10:32, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
>>
>> When zeroing a range of folios on the filesystem which block size is
>> less than the page size, the file's mapped partial blocks within one
>> page will be marked as unwritten, we should remove writable userspace
>> mappings to ensure that ext4_page_mkwrite() can be called during
>> subsequent write access to these folios. Otherwise, data written by
>> subsequent mmap writes may not be saved to disk.
>>
>> $mkfs.ext4 -b 1024 /dev/vdb
>> $mount /dev/vdb /mnt
>> $xfs_io -t -f -c "pwrite -S 0x58 0 4096" -c "mmap -rw 0 4096" \
>> -c "mwrite -S 0x5a 2048 2048" -c "fzero 2048 2048" \
>> -c "mwrite -S 0x59 2048 2048" -c "close" /mnt/foo
>>
>> $od -Ax -t x1z /mnt/foo
>> 000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
>> *
>> 000800 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59 59
>> *
>> 001000
>>
>> $umount /mnt && mount /dev/vdb /mnt
>> $od -Ax -t x1z /mnt/foo
>> 000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
>> *
>> 000800 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> *
>> 001000
>>
>> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
>
> This is a great catch! I think this may be source of the sporadic data
> corruption issues we observe with blocksize < pagesize.
>
>> +static inline void ext4_truncate_folio(struct inode *inode,
>> + loff_t start, loff_t end)
>> +{
>> + unsigned long blocksize = i_blocksize(inode);
>> + struct folio *folio;
>> +
>> + if (round_up(start, blocksize) >= round_down(end, blocksize))
>> + return;
>> +
>> + folio = filemap_lock_folio(inode->i_mapping, start >> PAGE_SHIFT);
>> + if (IS_ERR(folio))
>> + return;
>> +
>> + if (folio_mkclean(folio))
>> + folio_mark_dirty(folio);
>> + folio_unlock(folio);
>> + folio_put(folio);
>
> I don't think this is enough. In your example from the changelog, this would
> leave the page at index 0 dirty and still with 0x5a values in 2048-4096 range.
> Then truncate_pagecache_range() does nothing, ext4_alloc_file_blocks()
> converts blocks under 2048-4096 to unwritten state. But what handles
> zeroing of page cache in 2048-4096 range? ext4_zero_partial_blocks() zeroes
> only partial blocks, not full blocks. Am I missing something?
>

Sorry, I don't understand why truncate_pagecache_range() does nothing? In my
example, the variable 'start' is 2048, the variable 'end' is 4096, and the
call process truncate_pagecache_range(inode, 2048, 4096-1)->..->
truncate_inode_partial_folio()->folio_zero_range() does zeroing the 2048-4096
range. I also tested it below, it was zeroed.

xfs_io -t -f -c "pwrite -S 0x58 0 4096" -c "mmap -rw 0 4096" \
-c "mwrite -S 0x5a 2048 2048" \
-c "fzero 2048 2048" -c "close" /mnt/foo

od -Ax -t x1z /mnt/foo
000000 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 >XXXXXXXXXXXXXXXX<
*
000800 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >................<
*
001000

> If I'm right, I'd keep it simple and just writeout these partial folios with
> filemap_write_and_wait_range() and expand the range
> truncate_pagecache_range() removes to include these partial folios. The

What I mean is the truncate_pagecache_range() has already covered the partial
folios. right?

> overhead won't be big and it isn't like this is some very performance
> sensitive path.
>
>> +}
>> +
>> +/*
>> + * When truncating a range of folios, if the block size is less than the
>> + * page size, the file's mapped partial blocks within one page could be
>> + * freed or converted to unwritten. We should call this function to remove
>> + * writable userspace mappings so that ext4_page_mkwrite() can be called
>> + * during subsequent write access to these folios.
>> + */
>> +void ext4_truncate_folios_range(struct inode *inode, loff_t start, loff_t end)
>
> Maybe call this ext4_truncate_page_cache_block_range()? And assert that
> start & end are block aligned. Because this essentially prepares page cache
> for manipulation with a block range.

Ha, it's a good idea, I agree with you that move truncate_pagecache_range()
and the hunk of flushing in journal data mode into this function. But I don't
understand why assert that 'start & end' are block aligned? I think
ext4_truncate_page_cache_block_range() should allow passing unaligned input
parameters and aligned them itself, especially, after patch 04 and 05,
ext4_zero_range() and ext4_punch_hole() will passing offset and offset+len
directly, which may block unaligned.

Thanks,
Yi.

>
>> +{
>> + unsigned long blocksize = i_blocksize(inode);
>> +
>> + if (end > inode->i_size)
>> + end = inode->i_size;
>> + if (start >= end || blocksize >= PAGE_SIZE)
>> + return;
>> +
>> + ext4_truncate_folio(inode, start, min(round_up(start, PAGE_SIZE), end));
>> + if (end > round_up(start, PAGE_SIZE))
>> + ext4_truncate_folio(inode, round_down(end, PAGE_SIZE), end);
>> +}
>
> So I'd move the following truncate_pagecache_range() into
> ext4_truncate_folios_range(). And also the preceding:
>
> /*
> * For journalled data we need to write (and checkpoint) pages
> * before discarding page cache to avoid inconsitent data on
> * disk in case of crash before zeroing trans is committed.
> */
> if (ext4_should_journal_data(inode)) {
> ret = filemap_write_and_wait_range(mapping, start,
> end - 1);
> ...
>
> into this function. So that it can be self-contained "do the right thing
> with page cache to prepare for block range manipulations".
>
> Honza