Re: [PATCH v4 09/23] ext4: implement writeback path using iomap

From: Ojaswin Mujoo

Date: Fri May 29 2026 - 11:58:42 EST

On Fri, May 29, 2026 at 10:12:12PM +0800, Zhang Yi wrote:
> Hi, Ojaswin!
>
> On 5/27/2026 8:49 PM, Ojaswin Mujoo wrote:
> > On Mon, May 11, 2026 at 03:23:29PM +0800, Zhang Yi wrote:
> > > From: Zhang Yi <yi.zhang@xxxxxxxxxx>
> > >
> > > Add the iomap writeback path for ext4 buffered I/O. This introduces:
> > >
> > > - ext4_iomap_writepages(): the main writeback entry point.
> > > - ext4_writeback_ops: a new iomap_writeback_ops instance to handle
> > > block mapping and I/O submission.
> > > - A new end I/O worker for converting unwritten extents, updating file
> > > size, and handling DATA_ERR_ABORT after I/O completion.
> > >
> > > Core implementation details:
> > >
> > > - ->writeback_range() callback
> > > Calls ext4_iomap_map_writeback_range() to query the longest range of
> > > existing mapped extents. For performance, when a block range is not
> > > yet allocated, it allocates based on the writeback length and delalloc
> > > extent length, rather than allocating for a single folio at a time.
> > > The folio is then added to an iomap_ioend instance.
> > >
> > > - ->writeback_submit() callback
> > > Registers ext4_iomap_end_bio() as the end bio callback. This callback
> > > schedules a worker to handle:
> > > - Unwritten extent conversion.
> > > - i_disksize update after data is written back.
> > > - Journal abort on writeback I/O failure.
> >
> > Hi Zhang, the changes look good. I have a few comments below:
> > >
> > > Key changes and considerations:
> > >
> > > - Append write and unwritten extents
> > > Since data=ordered mode is not used to prevent stale data exposure
> > > during append writebacks, new blocks are always allocated as unwritten
> > > extents (i.e. always enable dioread_nolock), and i_disksize update is
> > > postponed until I/O completion.
> >
> > Makes sense.
> >
> > > Additionally, the deadlock that the
> > > reserve handle was expected to resolve does not occur anymore.
> >
> > I guess this is since we don't use ordered data so we can't block on
> > starting a txn in end io.
>
> Yep.
>
> >
> > > Therefore, the end I/O worker can start a normal journal handle
> > > instead of a reserve handle when converting unwritten extents.
> > >
> > > - Lock ordering
> > > The ->writeback_range() callback runs under the folio lock, requiring
> > > the journal handle to be started under that same lock. This reverses
> > > the order compared to the buffer_head writeback path. The lock ordering
> > > documentation in super.c has been updated accordingly.
> > >
> > > Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
> > > ---
> > > fs/ext4/ext4.h | 4 +
> > > fs/ext4/inode.c | 208 +++++++++++++++++++++++++++++++++++++++++-
> > > fs/ext4/page-io.c | 126 +++++++++++++++++++++++++
> > > fs/ext4/super.c | 7 +-
> > > fs/iomap/ioend.c | 3 +-
> > > include/linux/iomap.h | 1 +
> > > 6 files changed, 346 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > > index 4832e7f7db82..078feda47e36 100644
> > > --- a/fs/ext4/ext4.h
> > > +++ b/fs/ext4/ext4.h
> > > @@ -1173,6 +1173,8 @@ struct ext4_inode_info {
> > > */
> > > struct list_head i_rsv_conversion_list;
> > > struct work_struct i_rsv_conversion_work;
> > > + struct list_head i_iomap_ioend_list;
> > > + struct work_struct i_iomap_ioend_work;
> > > /*
> > > * Transactions that contain inode's metadata needed to complete
> > > @@ -3870,6 +3872,8 @@ int ext4_bio_write_folio(struct ext4_io_submit *io, struct folio *page,
> > > size_t len);
> > > extern struct ext4_io_end_vec *ext4_alloc_io_end_vec(ext4_io_end_t *io_end);
> > > extern struct ext4_io_end_vec *ext4_last_io_end_vec(ext4_io_end_t *io_end);
> > > +extern void ext4_iomap_end_io(struct work_struct *work);
> > > +extern void ext4_iomap_end_bio(struct bio *bio);
> > > /* mmp.c */
> > > extern int ext4_multi_mount_protect(struct super_block *, ext4_fsblk_t);
> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > index 1ae7d3f4a1c8..a80195bd6f20 100644
> > > --- a/fs/ext4/inode.c
> > > +++ b/fs/ext4/inode.c
> > > @@ -44,6 +44,7 @@
> > > #include <linux/iversion.h>
> > > #include "ext4_jbd2.h"
> > > +#include "ext4_extents.h"
> > > #include "xattr.h"
> > > #include "acl.h"
> > > #include "truncate.h"
> > > @@ -4120,10 +4121,215 @@ static void ext4_iomap_readahead(struct readahead_control *rac)
> > > iomap_bio_readahead(rac, &ext4_iomap_buffered_read_ops);
> > > }
> > > +static int ext4_iomap_map_one_extent(struct inode *inode,
> > > + struct ext4_map_blocks *map)
> > > +{
> > > + struct extent_status es;
> > > + handle_t *handle = NULL;
> > > + int credits, map_flags;
> > > + int retval;
> > > +
> > > + credits = ext4_chunk_trans_blocks(inode, map->m_len);
> > > + handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, credits);
> > > + if (IS_ERR(handle))
> > > + return PTR_ERR(handle);
> > > +
> > > + map->m_flags = 0;
> > > + /*
> > > + * It is necessary to look up extent and map blocks under i_data_sem
> > > + * in write mode, otherwise, the delalloc extent may become stale
> > > + * during concurrent truncate operations.
> > > + */
> > > + ext4_fc_track_inode(handle, inode);
> > > + down_write(&EXT4_I(inode)->i_data_sem);
> > > + if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es, &map->m_seq)) {
> > > + retval = es.es_len - (map->m_lblk - es.es_lblk);
> > > + map->m_len = min_t(unsigned int, retval, map->m_len);
> > > +
> > > + if (ext4_es_is_delayed(&es)) {
> >
> > I understand that it is okay for us to rely on extent status ==
> > delayed here because we never reclaim delayed es entries and hence we
> > are sure to not skip any delayed block allocations here.
>
> Yeah, right.
>
> >
> > > + map->m_flags |= EXT4_MAP_DELAYED;
> > > + trace_ext4_da_write_pages_extent(inode, map);
> > > + /*
> > > + * Call ext4_map_create_blocks() to allocate any
> > > + * delayed allocation blocks. It is possible that
> > > + * we're going to need more metadata blocks, however
> > > + * we must not fail because we're in writeback and
> > > + * there is nothing we can do so it might result in
> > > + * data loss. So use reserved blocks to allocate
> > > + * metadata if possible.
> > > + */
> > > + map_flags = EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT |
> > > + EXT4_GET_BLOCKS_METADATA_NOFAIL |
> > > + EXT4_EX_NOCACHE;
> > > +
> > > + retval = ext4_map_create_blocks(handle, inode, map,
> > > + map_flags);
> > > + if (retval > 0)
> > > + ext4_fc_track_range(handle, inode, map->m_lblk,
> > > + map->m_lblk + map->m_len - 1);
> > > + goto out;
> > > + } else if (unlikely(ext4_es_is_hole(&es)))
> >
> > Now that you've fixed the partial invalidate in iomap (patch 12/23)
> > can we still hit this hole case?
>
> Theoretically, no hole should be encountered; this is just defensive
> programming.
>
> >
> > > + goto out;
> > > +
> > > + /* Found written or unwritten extent. */
> > > + map->m_pblk = ext4_es_pblock(&es) + map->m_lblk - es.es_lblk;
> > > + map->m_flags = ext4_es_is_written(&es) ?
> > > + EXT4_MAP_MAPPED : EXT4_MAP_UNWRITTEN;
> > > + goto out;
> > > + }
> > > +
> > > + retval = ext4_map_query_blocks(handle, inode, map, EXT4_EX_NOCACHE);
> > > +out:
> > > + up_write(&EXT4_I(inode)->i_data_sem);
> > > + ext4_journal_stop(handle);
> > > + return retval < 0 ? retval : 0;
> > > +}
> > > +
> > > +static int ext4_iomap_map_writeback_range(struct iomap_writepage_ctx *wpc,
> > > + loff_t offset, unsigned int dirty_len)
> > > +{
> > > + struct inode *inode = wpc->inode;
> > > + struct super_block *sb = inode->i_sb;
> > > + struct journal_s *journal = EXT4_SB(sb)->s_journal;
> > > + struct ext4_map_blocks map;
> > > + unsigned int blkbits = inode->i_blkbits;
> > > + unsigned int index = offset >> blkbits;
> > > + unsigned int blk_end, blk_len;
> > > + int ret;
> > > +
> > > + ret = ext4_emergency_state(sb);
> > > + if (unlikely(ret))
> > > + return ret;
> > > +
> > > + /* Check validity of the cached writeback mapping. */
> > > + if (offset >= wpc->iomap.offset &&
> > > + offset < wpc->iomap.offset + wpc->iomap.length &&
> > > + ext4_iomap_valid(inode, &wpc->iomap))
> > > + return 0;
> > > +
> > > + blk_len = dirty_len >> blkbits;
> > > + blk_end = min_t(unsigned int, (wpc->wbc->range_end >> blkbits),
> > > + (UINT_MAX - 1));
> >
> > This is an interesting idea. I'm just a bit worried when we have
> > range_end == LLONG_MAX (bg flush) and we will always be trying to allocate
> > MAX_WRITEPAGES, incase of a slightly fragmented FS, we might keep
> > falling into slower mballoc criterias and might waste a lot of time
> > scanning the groups.
>
> Actually, MAX_WRITEPAGES is not allocated every time; the allocated
> length also depends on the length of data that has already been delayed
> for writing, and the smaller value is taken. If the user has indeed

Hmm so we take the blk_end based on range_end (which is LLONG_MAX for bg
flusher) and then our blk_len will be set accordingly and would become a
large number as well. Then we will set map.m_len based on this blk_len
and MAX_WRITEPAGES. Am I missing something that clamps our m_len?

> performed delalloc writes on data of up to MAX_WRITEPAGES in length,
> then regardless of how fragmented the file system is, I will need to
> allocate blocks of that length. Reducing the number of calls is always
> beneficial.
>
> >
> > > + if (blk_end > index + blk_len)
> > > + blk_len = blk_end - index + 1;
> > > +
> > > +retry:
> > > + map.m_lblk = index;
> > > + map.m_len = min_t(unsigned int, MAX_WRITEPAGES_EXTENT_LEN, blk_len);
> > > + ret = ext4_map_blocks(NULL, inode, &map,
> > > + EXT4_GET_BLOCKS_IO_SUBMIT | EXT4_EX_NOCACHE);
> >
> > Do we really need the IO_SUBMIT flag here now that we are:
> > 1. Not using ordered data
> > 2. We anyways don't use it in ext4_iomap_map_one_extent().
> >
> > I think we can drop it.
>
> We can't drop it, because IO_SUBMIT is also used to avoid the check of
> ext4_check_map_extents_env() in ext4_map_blocks() under the writeback
> path.

Yes, noted :)

Thanks,
Ojaswin

>
> Cheers,
> Yi.
>
> >
> > > + if (ret < 0)
> > > + return ret;
> > > +
> > > + /*
> > > + * The map is not a delalloc extent, it must either be a hole
> > > + * or an extent which have already been allocated.
> > > + */
> > > + if (!(map.m_flags & EXT4_MAP_DELAYED))
> > > + goto out;
> > > +
> > > + /* Map one delalloc extent. */
> > > + ret = ext4_iomap_map_one_extent(inode, &map);
> > > + if (ret < 0) {
> > > + if (ext4_emergency_state(sb))
> > > + return ret;
> > > +
> > > + /*
> > > + * Retry transient ENOSPC errors, if
> > > + * ext4_count_free_blocks() is non-zero, a commit
> > > + * should free up blocks.
> > > + */
> > > + if (ret == -ENOSPC && journal && ext4_count_free_clusters(sb)) {
> > > + jbd2_journal_force_commit_nested(journal);
> > > + goto retry;
> > > + }
> > > +
> > > + ext4_msg(sb, KERN_CRIT,
> > > + "Delayed block allocation failed for inode %llu at logical offset %llu with max blocks %u with error %d",
> > > + inode->i_ino, (unsigned long long)map.m_lblk,
> > > + (unsigned int)map.m_len, -ret);
> > > + ext4_msg(sb, KERN_CRIT,
> > > + "This should not happen!! Data will be lost\n");
> > > + if (ret == -ENOSPC)
> > > + ext4_print_free_blocks(inode);
> > > + return ret;
> > > + }
> > > +out:
> > > + ext4_set_iomap(inode, &wpc->iomap, &map, offset, dirty_len, 0);
> > > + return 0;
> > > +}
> > > +
> >
> > <snip>
> > >
>