Re: [PATCH v4 06/23] ext4: pass out extent seq counter when mapping da blocks
From: Zhang Yi
Date: Wed Jun 17 2026 - 21:43:43 EST
On 6/18/2026 4:29 AM, Jan Kara wrote:
> On Tue 16-06-26 20:37:00, Zhang Yi wrote:
>> On 6/16/2026 6:04 PM, Jan Kara wrote:
>>> On Mon 11-05-26 15:23:26, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
>>>>
>>>> The iomap buffered write path does not hold any locks between querying
>>>> inode extent mapping information and performing buffered writes. It
>>>> relies on the sequence counter saved in the inode to detect stale
>>>> mappings.
>>>
>>> Now that I'm looking at it again, I've got a bit confused here. Buffered
>>> write path is holding i_rwsem between mapping blocks and using them so
>>> there shouldn't be races. Perhaps you mean buffered *writeback* path? But
>>> then ext4_da_map_blocks() should not ever get called in the writeback path
>>> because it is allocating delayed blocks... So this change looks unnecessary
>>> to me now. Am I missing something?
>>>
>>> Honza
>>>
>>
>> Hi Jan,
>>
>> Thanks for coming back to this series. Sorry for the inaccurate
>> description in the commit message. However, this change is still needed.
>>
>> As mentioned in the comment before the ->iomap_valid() callback in
>> iomap_write_begin(), consider the following scenario — a buffered write
>> to a dirty unwritten extent, with this concurrent race:
>>
>> Buffer write (holds i_rwsem) Writeback (no i_rwsem)
>> ext4_da_map_blocks()
>> // ext4_es_lookup_extent()
>> // finds UNWRITTEN extent
>> ext4_set_iomap()
>> // type = IOMAP_UNWRITTEN
>> // validity_cookie = m_seq
>> ext4_iomap_writepages()
>> // writeback for unwritten extent
>> // ext4_convert_unwritten_extents()
>> // extent tree: UNWRITTEN → WRITTEN
>> // advancing i_es_seq
>> __iomap_write_begin()
>> // ext4_iomap_valid(): cookie != i_es_seq
>> // iomap invalidated, re-maps
>> // gets type = IOMAP_MAPPED (WRITTEN)
>> // iomap_block_needs_zeroing(): FALSE
>>
>> Without passing out m_seq, the stale IOMAP_UNWRITTEN type from the iomap
>> would cause __iomap_write_begin()->iomap_block_needs_zeroing() to zero
>> out blocks that have already been written, corrupting data on partial
>> writes.
>
> Ah, right, thanks for explanation. So please update the description to
> explicitely mention that iomap doesn't hold folio lock between mapping the
> extent and copying data and thus can race with writeback modifying the
> extent type. Thanks!
>
> Honza
Sure, thanks for pointing this out.
Cheers,
Yi.
>
>>
>> Thanks,
>> Yi
>>
>>>>
>>>> Commit 07c440e8da8f ("ext4: pass out extent seq counter when mapping
>>>> blocks") added the m_seq field to ext4_map_blocks to pass out extent
>>>> sequence numbers, but it missed two callsites within
>>>> ext4_da_map_blocks(). These callsites are on the delayed allocation
>>>> path, which is also used in the iomap buffered write path. Pass out the
>>>> sequence counter to ensure stale mappings can be detected.
>>>>
>>>> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
>>>> Reviewed-by: Jan Kara <jack@xxxxxxx>
>>>> ---
>>>> fs/ext4/inode.c | 4 ++--
>>>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index 6c4d9137b279..39577a6b65b9 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -1929,7 +1929,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
>>>> ext4_check_map_extents_env(inode);
>>>> /* Lookup extent status tree firstly */
>>>> - if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, NULL)) {
>>>> + if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
>>>> map->m_len = min_t(unsigned int, map->m_len,
>>>> es.es_len - (map->m_lblk - es.es_lblk));
>>>> @@ -1982,7 +1982,7 @@ static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
>>>> * is held in write mode, before inserting a new da entry in
>>>> * the extent status tree.
>>>> */
>>>> - if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, NULL)) {
>>>> + if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
>>>> map->m_len = min_t(unsigned int, map->m_len,
>>>> es.es_len - (map->m_lblk - es.es_lblk));
>>>> --
>>>> 2.52.0
>>>>
>>