Re: [PATCH 12/27] ext4: introduce seq counter for the extent status entry

From: Zhang Yi
Date: Wed Dec 11 2024 - 21:32:26 EST


On 2024/12/12 0:00, Jan Kara wrote:
> On Wed 11-12-24 15:59:51, Zhang Yi wrote:
>> On 2024/12/10 20:57, Jan Kara wrote:
>>> On Mon 09-12-24 16:32:41, Zhang Yi wrote:
>>> b) evict all page cache in the affected range -> should stop writeback -
>>> *but* currently there's one case which could be problematic. Assume we
>>> do punch hole 0..N and the page at N+1 is dirty. Punch hole does all of
>>> the above and starts removing blocks, needs to restart transaction so it
>>> drops i_data_sem. Writeback starts for page N+1, needs to load extent
>>> block into memory, ext4_cache_extents() now loads back some extents
>>> covering range 0..N into extent status tree.
>>
>> This completely confuses me. Do you mention the case below,
>>
>> There are many extent entries in the page range 0..N+1, for example,
>>
>> 0 N N+1
>> | |/
>> [www][wwwwww][wwwwwwww]...[wwwww][wwww]...
>> | |
>> N-a N-b
>>
>> Punch hole is removing each extent entries from N..0
>> (ext4_ext_remove_space() removes blocks from end to start), and could
>> drop i_data_sem just after manipulating(removing) the extent entry
>> [N-a,N-b], At the same time, a concurrent writeback start write back
>> page N+1 since the writeback only hold page lock, doesn't hold i_rwsem
>> and invalidate_lock. It may load back the extents 0..N-a into the
>> extent status tree again while finding extent that contains N+1?
>
> Yes, because when we load extents from extent tree, we insert all extents
> from the leaf of the extent tree into extent status tree. That's what
> ext4_cache_extents() call does.
>
>> Finally it may left some stale extent status entries after punch hole
>> is done?
>
> Yes, there may be stale extents in the extent status tree when
> ext4_ext_remove_space() returns. But punch hole in particular then does:
>
> ext4_es_insert_extent(inode, first_block, hole_len, ~0,
> EXTENT_STATUS_HOLE, 0);
>
> which overwrites these stale extents with appropriate information.
>

Yes, that's correct! I missed this insert yesterday. It looks fine now,
as it holds the i_rwsem and invalidate_lock, and has evicted the page
cache in this case. Thanks a lot for your detail explanation. I will
add these document in my next iteration.

Thanks!
Yi.

>> If my understanding is correct, isn't that a problem that exists now?
>> I mean without this patch series.
>
> Yes, the situation isn't really related to your patches. But with your
> patches we are starting to rely even more on extent status tree vs extent
> tree consistecy. So I wanted to spell out this situation to verify new
> problem isn't introduced and so that we create rules that handle this
> situation well.
>
>>> So the only protection
>>> against using freed blocks is that nobody should be mapping anything in
>>> the range 0..N because we hold those locks & have evicted page cache.
>>>
>>> So I think we need to also document, that anybody mapping blocks needs to
>>> hold i_rwsem or invalidate_lock or a page lock, ideally asserting that in
>>> ext4_map_blocks() to catch cases we missed. Asserting for page lock will
>>> not be really doable but luckily only page writeback needs that so that can
>>> get some extemption from the assert.
>>
>> In the case above, it seems that merely holding a page lock is
>> insufficient?
>
> Well, holding page lock(s) for the range you are operating on is enough to
> make sure there cannot be parallel operations on that range like truncate,
> punch hole or similar, because they always remove the page cache before
> starting their work and because they hold invalidate_lock, new pages cannot
> be created while they are working.
>
> Honza