Re: [PATCH v4 02/34] ext4: check the extent status again before inserting delalloc block

From: IBM
Date: Fri Apr 26 2024 - 08:57:59 EST


Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx> writes:

> Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> writes:
>
>> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
>>
>> Now we lookup extent status entry without holding the i_data_sem before
>> inserting delalloc block, it works fine in buffered write path and
>> because it holds i_rwsem and folio lock, and the mmap path holds folio
>> lock, so the found extent locklessly couldn't be modified concurrently.
>> But it could be raced by fallocate since it allocate block whitout
>> holding i_rwsem and folio lock.
>>
>> ext4_page_mkwrite() ext4_fallocate()
>> block_page_mkwrite()
>> ext4_da_map_blocks()
>> //find hole in extent status tree
>> ext4_alloc_file_blocks()
>> ext4_map_blocks()
>> //allocate block and unwritten extent
>> ext4_insert_delayed_block()
>> ext4_da_reserve_space()
>> //reserve one more block
>> ext4_es_insert_delayed_block()
>> //drop unwritten extent and add delayed extent by mistake
>>
>> Then, the delalloc extent is wrong until writeback, the one more
>> reserved block can't be release any more and trigger below warning:
>>
>> EXT4-fs (pmem2): Inode 13 (00000000bbbd4d23): i_reserved_data_blocks(1) not cleared!
>>
>> Hold i_data_sem in write mode directly can fix the problem, but it's
>> expansive, we should keep the lockless check and check the extent again
>> once we need to add an new delalloc block.
>
> Hi Zhang,
>
> It's a nice finding. I was wondering if this was caught in any of the
> xfstests?
>
> I have reworded some of the commit message, feel free to use it if you
> think this version is better. The use of which path uses which locks was
> a bit confusing in the original commit message.
>
> <reworded from your original commit msg>
>
> ext4_da_map_blocks(), first looks up the extent status tree for any
> extent entry with i_data_sem held in read mode. It then unlocks
> i_data_sem, if it can't find an entry and take this lock in write
> mode for inserting a new da entry.

Sorry about this above paragraph. I messed this paragraph.
Here is the correct version of this.

ext4_da_map_blocks looks up for any extent entry in the extent status
tree (w/o i_data_sem) and then the looks up for any ondisk extent
mapping (with i_data_sem in read mode).

If it finds a hole in the extent status tree or if it couldn't find any
entry at all, it then takes the i_data_sem in write mode to add a da entry
into the extent status tree. This can actually race with page mkwrite
& fallocate path.

Note that this is ok between... <and the rest can remain same>

>
> This is ok between -
> 1. ext4 buffered-write path v/s ext4_page_mkwrite(), because of the
> folio lock
> 2. ext4 buffered write path v/s ext4 fallocate because of the inode
> lock.
>


> But this can race between ext4_page_mkwrite() & ext4 fallocate path -
>
> ext4_page_mkwrite() ext4_fallocate()
> block_page_mkwrite()
> ext4_da_map_blocks()
> //find hole in extent status tree
> ext4_alloc_file_blocks()
> ext4_map_blocks()
> //allocate block and unwritten extent
> ext4_insert_delayed_block()
> ext4_da_reserve_space()
> //reserve one more block
> ext4_es_insert_delayed_block()
> //drop unwritten extent and add delayed extent by mistake
>
> Then, the delalloc extent is wrong until writeback and the extra
> reserved block can't be released any more and it triggers below warning:
>
> EXT4-fs (pmem2): Inode 13 (00000000bbbd4d23): i_reserved_data_blocks(1) not cleared!
>
> This patch fixes the problem by looking up extent status tree again
> while the i_data_sem is held in write mode. If it still can't find
> any entry, then we insert a new da entry into the extent status tree.
>
>>
>> Cc: stable@xxxxxxxxxxxxxxx
>> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
>> ---
>> fs/ext4/inode.c | 19 +++++++++++++++++++
>> 1 file changed, 19 insertions(+)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 6a41172c06e1..118b0497a954 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -1737,6 +1737,7 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
>> if (ext4_es_is_hole(&es))
>> goto add_delayed;
>>
>> +found:
>> /*
>> * Delayed extent could be allocated by fallocate.
>> * So we need to check it.
>> @@ -1781,6 +1782,24 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
>>
>> add_delayed:
>> down_write(&EXT4_I(inode)->i_data_sem);
>> + /*
>> + * Lookup extents tree again under i_data_sem, make sure this
>> + * inserting delalloc range haven't been delayed or allocated
>> + * whitout holding i_rwsem and folio lock.
>> + */
>
> page fault path (ext4_page_mkwrite does not take i_rwsem) and fallocate
> path (no folio lock) can race. Make sure we lookup the extent status
> tree here again while i_data_sem is held in write mode, before inserting
> a new da entry in the extent status tree.
>
>


-ritesh