Re: [PATCH v4 09/23] ext4: implement writeback path using iomap

From: Ojaswin Mujoo

Date: Wed May 27 2026 - 08:50:40 EST

On Mon, May 11, 2026 at 03:23:29PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
>
> Add the iomap writeback path for ext4 buffered I/O. This introduces:
>
> - ext4_iomap_writepages(): the main writeback entry point.
> - ext4_writeback_ops: a new iomap_writeback_ops instance to handle
> block mapping and I/O submission.
> - A new end I/O worker for converting unwritten extents, updating file
> size, and handling DATA_ERR_ABORT after I/O completion.
>
> Core implementation details:
>
> - ->writeback_range() callback
> Calls ext4_iomap_map_writeback_range() to query the longest range of
> existing mapped extents. For performance, when a block range is not
> yet allocated, it allocates based on the writeback length and delalloc
> extent length, rather than allocating for a single folio at a time.
> The folio is then added to an iomap_ioend instance.
>
> - ->writeback_submit() callback
> Registers ext4_iomap_end_bio() as the end bio callback. This callback
> schedules a worker to handle:
> - Unwritten extent conversion.
> - i_disksize update after data is written back.
> - Journal abort on writeback I/O failure.

Hi Zhang, the changes look good. I have a few comments below:
>
> Key changes and considerations:
>
> - Append write and unwritten extents
> Since data=ordered mode is not used to prevent stale data exposure
> during append writebacks, new blocks are always allocated as unwritten
> extents (i.e. always enable dioread_nolock), and i_disksize update is
> postponed until I/O completion.

Makes sense.

> Additionally, the deadlock that the
> reserve handle was expected to resolve does not occur anymore.

I guess this is since we don't use ordered data so we can't block on
starting a txn in end io.

> Therefore, the end I/O worker can start a normal journal handle
> instead of a reserve handle when converting unwritten extents.
>
> - Lock ordering
> The ->writeback_range() callback runs under the folio lock, requiring
> the journal handle to be started under that same lock. This reverses
> the order compared to the buffer_head writeback path. The lock ordering
> documentation in super.c has been updated accordingly.
>
> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
> ---
> fs/ext4/ext4.h | 4 +
> fs/ext4/inode.c | 208 +++++++++++++++++++++++++++++++++++++++++-
> fs/ext4/page-io.c | 126 +++++++++++++++++++++++++
> fs/ext4/super.c | 7 +-
> fs/iomap/ioend.c | 3 +-
> include/linux/iomap.h | 1 +
> 6 files changed, 346 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 4832e7f7db82..078feda47e36 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1173,6 +1173,8 @@ struct ext4_inode_info {
> */
> struct list_head i_rsv_conversion_list;
> struct work_struct i_rsv_conversion_work;
> + struct list_head i_iomap_ioend_list;
> + struct work_struct i_iomap_ioend_work;
>
> /*
> * Transactions that contain inode's metadata needed to complete
> @@ -3870,6 +3872,8 @@ int ext4_bio_write_folio(struct ext4_io_submit *io, struct folio *page,
> size_t len);
> extern struct ext4_io_end_vec *ext4_alloc_io_end_vec(ext4_io_end_t *io_end);
> extern struct ext4_io_end_vec *ext4_last_io_end_vec(ext4_io_end_t *io_end);
> +extern void ext4_iomap_end_io(struct work_struct *work);
> +extern void ext4_iomap_end_bio(struct bio *bio);
>
> /* mmp.c */
> extern int ext4_multi_mount_protect(struct super_block *, ext4_fsblk_t);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 1ae7d3f4a1c8..a80195bd6f20 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -44,6 +44,7 @@
> #include <linux/iversion.h>
>
> #include "ext4_jbd2.h"
> +#include "ext4_extents.h"
> #include "xattr.h"
> #include "acl.h"
> #include "truncate.h"
> @@ -4120,10 +4121,215 @@ static void ext4_iomap_readahead(struct readahead_control *rac)
> iomap_bio_readahead(rac, &ext4_iomap_buffered_read_ops);
> }
>
> +static int ext4_iomap_map_one_extent(struct inode *inode,
> + struct ext4_map_blocks *map)
> +{
> + struct extent_status es;
> + handle_t *handle = NULL;
> + int credits, map_flags;
> + int retval;
> +
> + credits = ext4_chunk_trans_blocks(inode, map->m_len);
> + handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, credits);
> + if (IS_ERR(handle))
> + return PTR_ERR(handle);
> +
> + map->m_flags = 0;
> + /*
> + * It is necessary to look up extent and map blocks under i_data_sem
> + * in write mode, otherwise, the delalloc extent may become stale
> + * during concurrent truncate operations.
> + */
> + ext4_fc_track_inode(handle, inode);
> + down_write(&EXT4_I(inode)->i_data_sem);
> + if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
> + retval = es.es_len - (map->m_lblk - es.es_lblk);
> + map->m_len = min_t(unsigned int, retval, map->m_len);
> +
> + if (ext4_es_is_delayed(&es)) {

I understand that it is okay for us to rely on extent status ==
delayed here because we never reclaim delayed es entries and hence we
are sure to not skip any delayed block allocations here.

> + map->m_flags |= EXT4_MAP_DELAYED;
> + trace_ext4_da_write_pages_extent(inode, map);
> + /*
> + * Call ext4_map_create_blocks() to allocate any
> + * delayed allocation blocks. It is possible that
> + * we're going to need more metadata blocks, however
> + * we must not fail because we're in writeback and
> + * there is nothing we can do so it might result in
> + * data loss. So use reserved blocks to allocate
> + * metadata if possible.
> + */
> + map_flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT |
> + EXT4_GET_BLOCKS_METADATA_NOFAIL |
> + EXT4_EX_NOCACHE;
> +
> + retval = ext4_map_create_blocks(handle, inode, map,
> + map_flags);
> + if (retval > 0)
> + ext4_fc_track_range(handle, inode, map->m_lblk,
> + map->m_lblk + map->m_len - 1);
> + goto out;
> + } else if (unlikely(ext4_es_is_hole(&es)))

Now that you've fixed the partial invalidate in iomap (patch 12/23)
can we still hit this hole case?

> + goto out;
> +
> + /* Found written or unwritten extent. */
> + map->m_pblk = ext4_es_pblock(&es) + map->m_lblk - es.es_lblk;
> + map->m_flags = ext4_es_is_written(&es) ?
> + EXT4_MAP_MAPPED : EXT4_MAP_UNWRITTEN;
> + goto out;
> + }
> +
> + retval = ext4_map_query_blocks(handle, inode, map, EXT4_EX_NOCACHE);
> +out:
> + up_write(&EXT4_I(inode)->i_data_sem);
> + ext4_journal_stop(handle);
> + return retval < 0 ? retval : 0;
> +}
> +
> +static int ext4_iomap_map_writeback_range(struct iomap_writepage_ctx *wpc,
> + loff_t offset, unsigned int dirty_len)
> +{
> + struct inode *inode = wpc->inode;
> + struct super_block *sb = inode->i_sb;
> + struct journal_s *journal = EXT4_SB(sb)->s_journal;
> + struct ext4_map_blocks map;
> + unsigned int blkbits = inode->i_blkbits;
> + unsigned int index = offset >> blkbits;
> + unsigned int blk_end, blk_len;
> + int ret;
> +
> + ret = ext4_emergency_state(sb);
> + if (unlikely(ret))
> + return ret;
> +
> + /* Check validity of the cached writeback mapping. */
> + if (offset >= wpc->iomap.offset &&
> + offset < wpc->iomap.offset + wpc->iomap.length &&
> + ext4_iomap_valid(inode, &wpc->iomap))
> + return 0;
> +
> + blk_len = dirty_len >> blkbits;
> + blk_end = min_t(unsigned int, (wpc->wbc->range_end >> blkbits),
> + (UINT_MAX - 1));

This is an interesting idea. I'm just a bit worried when we have
range_end == LLONG_MAX (bg flush) and we will always be trying to allocate
MAX_WRITEPAGES, incase of a slightly fragmented FS, we might keep
falling into slower mballoc criterias and might waste a lot of time
scanning the groups.

> + if (blk_end > index + blk_len)
> + blk_len = blk_end - index + 1;
> +
> +retry:
> + map.m_lblk = index;
> + map.m_len = min_t(unsigned int, MAX_WRITEPAGES_EXTENT_LEN, blk_len);
> + ret = ext4_map_blocks(NULL, inode, &map,
> + EXT4_GET_BLOCKS_IO_SUBMIT | EXT4_EX_NOCACHE);

Do we really need the IO_SUBMIT flag here now that we are:
1. Not using ordered data
2. We anyways don't use it in ext4_iomap_map_one_extent().

I think we can drop it.

> + if (ret < 0)
> + return ret;
> +
> + /*
> + * The map is not a delalloc extent, it must either be a hole
> + * or an extent which have already been allocated.
> + */
> + if (!(map.m_flags & EXT4_MAP_DELAYED))
> + goto out;
> +
> + /* Map one delalloc extent. */
> + ret = ext4_iomap_map_one_extent(inode, &map);
> + if (ret < 0) {
> + if (ext4_emergency_state(sb))
> + return ret;
> +
> + /*
> + * Retry transient ENOSPC errors, if
> + * ext4_count_free_blocks() is non-zero, a commit
> + * should free up blocks.
> + */
> + if (ret == -ENOSPC && journal && ext4_count_free_clusters(sb)) {
> + jbd2_journal_force_commit_nested(journal);
> + goto retry;
> + }
> +
> + ext4_msg(sb, KERN_CRIT,
> + "Delayed block allocation failed for inode %llu at logical offset %llu with max blocks %u with error %d",
> + inode->i_ino, (unsigned long long)map.m_lblk,
> + (unsigned int)map.m_len, -ret);
> + ext4_msg(sb, KERN_CRIT,
> + "This should not happen!! Data will be lost\n");
> + if (ret == -ENOSPC)
> + ext4_print_free_blocks(inode);
> + return ret;
> + }
> +out:
> + ext4_set_iomap(inode, &wpc->iomap, &map, offset, dirty_len, 0);
> + return 0;
> +}
> +

<snip>
>