Re: [PATCH v4 08/23] ext4: implement buffered write path using iomap
From: Zhang Yi
Date: Tue Jun 02 2026 - 22:58:09 EST
On 6/2/2026 6:26 PM, Ojaswin Mujoo wrote:
> On Mon, May 11, 2026 at 03:23:28PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
>>
>> Introduce two new iomap_ops instances for ext4 buffered writes:
>>
>> - ext4_iomap_buffered_da_write_ops: for delayed allocation mode, using
>> ext4_da_map_blocks() to map delalloc extents.
>> - ext4_iomap_buffered_write_ops: for non-delayed allocation mode, using
>> ext4_iomap_get_blocks() to directly allocate blocks.
>>
>> Also add ext4_iomap_valid() for the iomap infrastructure to check extent
>> validity.
>>
>> Key changes and considerations:
>>
>> - Unwritten extents for new blocks (dioread_nolock always on)
>> Since data=ordered mode is not used to prevent stale data exposure in
>> the non-delayed allocation path, new blocks are always allocated as
>> unwritten extents.
>>
>> - Short write and write failure handling
>> a. Delalloc path: On short write or failure, the stale delalloc range
>> must be dropped and its space reservation released. Otherwise, a
>> clean folio may cover leftover delalloc extents, causing
>> inaccurate space reservation accounting.
>> b. Non-delalloc path: No cleanup of allocated blocks is needed on
>> short write.
>>
>> - Lock ordering reversal
>> The folio lock and transaction start ordering is reversed compared to
>> the buffer_head buffered write path. To handle this, the journal
>> handle must be stopped in iomap_begin() callbacks. The lock ordering
>> documentation in super.c has been updated accordingly.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
>
> I went through this again and after our discussion the changes looks
> okay. Just a small quesiton below but otherwise feel free to add:
>
> Reviewed-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx>
Thank you a lot for your careful review!
>
>> ---
>> fs/ext4/ext4.h | 4 ++
>> fs/ext4/file.c | 20 +++++-
>> fs/ext4/inode.c | 165 +++++++++++++++++++++++++++++++++++++++++++++++-
>> fs/ext4/super.c | 10 ++-
>> 4 files changed, 192 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index 1e27d73d7427..4832e7f7db82 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -3057,6 +3057,7 @@ int ext4_walk_page_buffers(handle_t *handle,
>> int do_journal_get_write_access(handle_t *handle, struct inode *inode,
>> struct buffer_head *bh);
>> void ext4_set_inode_mapping_order(struct inode *inode);
>> +int ext4_nonda_switch(struct super_block *sb);
>> #define FALL_BACK_TO_NONDELALLOC 1
>> #define CONVERT_INLINE_DATA 2
>
> <snip>
>
>> + ext4_set_iomap(inode, iomap, &map, offset, length, flags);
>> + return 0;
>> +}
>> +
>> +static int ext4_iomap_buffered_write_begin(struct inode *inode,
>> + loff_t offset, loff_t length, unsigned int flags,
>> + struct iomap *iomap, struct iomap *srcmap)
>> +{
>> + return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
>> + iomap, srcmap, false);
>> +}
>> +
>> +static int ext4_iomap_buffered_da_write_begin(struct inode *inode,
>> + loff_t offset, loff_t length, unsigned int flags,
>> + struct iomap *iomap, struct iomap *srcmap)
>> +{
>> + return ext4_iomap_buffered_do_write_begin(inode, offset, length, flags,
>> + iomap, srcmap, true);
>> +}
>> +
>> +/*
>> + * On write failure, drop the stale delayed allocation range and release
>> + * its reserved space for both start and end blocks. Otherwise, we may
>> + * leave a range of delayed extents covered by a clean folio, which can
>> + * result in inaccurate space reservation accounting.
>> + */
>> +static void ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset,
>> + loff_t length, struct iomap *iomap)
>> +{
>> + down_write(&EXT4_I(inode)->i_data_sem);
>> + ext4_es_remove_extent(inode, offset >> inode->i_blkbits,
>> + DIV_ROUND_UP_ULL(length, EXT4_BLOCK_SIZE(inode->i_sb)));
>> + up_write(&EXT4_I(inode)->i_data_sem);
>> +}
>> +
>> +static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
>> + loff_t length, ssize_t written,
>> + unsigned int flags,
>> + struct iomap *iomap)
>> +{
>> + loff_t start_byte, end_byte;
>> +
>> + /* If we didn't reserve the blocks, we're not allowed to punch them. */
>> + if (iomap->type != IOMAP_DELALLOC || !(iomap->flags & IOMAP_F_NEW))
>
> Will we ever get IOMAP_F_NEW here? I think the da_write_begin() call
> either creates a new IOMAP_DELALLOC extent or finds older ones which
> won't have EXT4_MAP_NEW set
>
Oops. This is a bug! In ext4_da_map_blocks(), when allocating a new
delalloc extent, the EXT4_MAP_NEW flag should be set. If this flag is
not set, then when a short write occurs, we cannot distinguish whether
an extent is a pre-existing delalloc extent or a newly allocated one.
This prevents the subsequent truncate operation from being executed,
leaving the newly allocated delalloc extent behind. I will fix this in
next iteration.
Thanks,
Yi.