Re: [PATCH v4 02/34] ext4: check the extent status again before inserting delalloc block

From: IBM
Date: Fri Apr 26 2024 - 12:39:44 EST


Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> writes:

> On 2024/4/26 20:57, Ritesh Harjani (IBM) wrote:
>> Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx> writes:
>>
>>> Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> writes:
>>>
>>>> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
>>>>
>>>> Now we lookup extent status entry without holding the i_data_sem before
>>>> inserting delalloc block, it works fine in buffered write path and
>>>> because it holds i_rwsem and folio lock, and the mmap path holds folio
>>>> lock, so the found extent locklessly couldn't be modified concurrently.
>>>> But it could be raced by fallocate since it allocate block whitout
>>>> holding i_rwsem and folio lock.
>>>>
>>>> ext4_page_mkwrite() ext4_fallocate()
>>>> block_page_mkwrite()
>>>> ext4_da_map_blocks()
>>>> //find hole in extent status tree
>>>> ext4_alloc_file_blocks()
>>>> ext4_map_blocks()
>>>> //allocate block and unwritten extent
>>>> ext4_insert_delayed_block()
>>>> ext4_da_reserve_space()
>>>> //reserve one more block
>>>> ext4_es_insert_delayed_block()
>>>> //drop unwritten extent and add delayed extent by mistake
>>>>
>>>> Then, the delalloc extent is wrong until writeback, the one more
>>>> reserved block can't be release any more and trigger below warning:
>>>>
>>>> EXT4-fs (pmem2): Inode 13 (00000000bbbd4d23): i_reserved_data_blocks(1) not cleared!
>>>>
>>>> Hold i_data_sem in write mode directly can fix the problem, but it's
>>>> expansive, we should keep the lockless check and check the extent again
>>>> once we need to add an new delalloc block.
>>>
>>> Hi Zhang,
>>>
>>> It's a nice finding. I was wondering if this was caught in any of the
>>> xfstests?
>>>
>
> Hi, Ritesh
>
> I caught this issue when I tested my iomap series in generic/344 and
> generic/346. It's easy to reproduce because the iomap's buffered write path
> doesn't hold folio lock while inserting delalloc blocks, so it could be raced
> by the mmap page fault path. But the buffer_head's buffered write path can't
> trigger this problem,

ya right! That's the difference between how ->map_blocks() is called
between buffer_head v/s iomap path. In iomap the ->map_blocks() call
happens first to map a large extent and then it iterate over all the
locked folios covering the mapped extent for doing writes.
Whereas in buffer_head while iterating, we first instantiate/lock the
folio and then call ->map_blocks() to map an extent for the given folio.

.. So this opens up this window for a race between iomap buffered write
path v/s page mkwrite path for inserting delalloc blocks entries.

> the race between buffered write path and fallocate path
> was discovered while I was analyzing the code, so I'm not sure if it could
> be caught by xfstests now, at least I haven't noticed this problem so far.
>

Did you mean the race between page fault path and fallocate path here?
Because buffered write path and fallocate path should not have any race
since both takes the inode_lock. I guess you meant page fault path and
fallocate path for which you wrote this patch too :)

I am surprised, why we cannot see the this race between page mkwrite and
fallocate in fstests for inserting da entries to extent status cache.
Because the race you identified looks like a legitimate race and is
mostly happening since ext4_da_map_blocks() was not doing the right
thing.
.. looking at the src/holetest, it doesn't really excercise this path.
So maybe we can writing such fstest to trigger this race.


>>> I have reworded some of the commit message, feel free to use it if you
>>> think this version is better. The use of which path uses which locks was
>>> a bit confusing in the original commit message.
>>>
>
> Thanks for the message improvement, it looks more clear then mine, I will
> use it.
>

Glad, it was helpful.

-ritesh