Re: [PATCH v4 14/23] ext4: implement partial block zero range path using iomap

From: Brian Foster

Date: Wed Jun 17 2026 - 07:29:47 EST

On Wed, Jun 17, 2026 at 04:14:40PM +0800, Zhang Yi wrote:
> On 6/16/2026 8:28 PM, Jan Kara wrote:
> > On Mon 11-05-26 15:23:34, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
> >>
> >> Introduce a new iomap_ops instance, ext4_iomap_zero_ops, along with
> >> ext4_iomap_block_zero_range() to implement block zeroing via the iomap
> >> infrastructure for ext4.
> >>
> >> ext4_iomap_block_zero_range() calls iomap_zero_range() with
> >> ext4_iomap_zero_begin() as the callback. The callback locates and zeros
> >> out either a mapped partial block or a dirty, unwritten partial block.
> >>
> >> Important constraints:
> >>
> >> Zeroing out under an active journal handle can cause deadlock, because
> >> the order of acquiring the folio lock and starting a handle is
> >> inconsistent with the iomap writeback path.
> >>
> >> Therefore, ext4_iomap_block_zero_range():
> >> - Must NOT be called under an active handle.
> >> - Cannot rely on data=ordered mode to ensure zeroed data persistence
> >> before updating i_disksize (for the cases of post-EOF append write,
> >> post-EOF fallocate, and truncate up). In subsequent patches, we will
> >> address this by synchronizing commit I/O but doesn't waiting for
> >> completion, and updating i_disksize to i_size only after the zeroed
> >> data has been written back.
> >>
> >> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
> >> ---
> >> fs/ext4/inode.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++
> >> 1 file changed, 92 insertions(+)
> >>
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index c6fe42d012fc..e0dae2501292 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -4101,6 +4101,51 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
> >> return 0;
> >> }
> >>
> >> +static int ext4_iomap_zero_begin(struct inode *inode,
> >> + loff_t offset, loff_t length, unsigned int flags,
> >> + struct iomap *iomap, struct iomap *srcmap)
> >> +{
> >> + struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);
> >
> > This looks like a layering violation to me. I don't think you can safely
> > assume the iomap you're passed is a part of iomap_iter...
> >
> >> + struct ext4_map_blocks map;
> >> + u8 blkbits = inode->i_blkbits;
> >> + unsigned int iomap_flags = 0;
> >> + int ret;
> >> +
> >> + ret = ext4_emergency_state(inode->i_sb);
> >> + if (unlikely(ret))
> >> + return ret;
> >> +
> >> + if (WARN_ON_ONCE(!(flags & IOMAP_ZERO)))
> >> + return -EINVAL;
> >> +
> >> + ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
> >> + if (ret < 0)
> >> + return ret;
> >> +
> >> + /*
> >> + * Look up dirty folios for unwritten mappings within EOF. Providing
> >> + * this bypasses the flush iomap uses to trigger extent conversion
> >> + * when unwritten mappings have dirty pagecache in need of zeroing.
> >> + */
> >> + if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> >> + loff_t start = ((loff_t)map.m_lblk) << blkbits;
> >> + loff_t end = ((loff_t)map.m_lblk + map.m_len) << blkbits;
> >> +
> >> + iomap_fill_dirty_folios(iter, &start, end, &iomap_flags);
> >> + if ((start >> blkbits) < map.m_lblk + map.m_len)
> >> + map.m_len = (start >> blkbits) - map.m_lblk;
> >> + }
> >
> > ... and you need access to iter only for this which seems to be really a
> > hack that's trying to outsmart the iomap code. I have to admit I don't
> > fully understand what you are trying to achieve here. Are you trying to
> > avoid flushing of the range that will be zeroed out?
>
> This logic is copied from the XFS and iomap infrastructure. Its primary
> purpose is to optimize the zeroing operations on dirty written extents.
> It was introduced by Brian in [1].
>
> The history as I understand it: originally, the iomap infrastructure
> could not zero dirty unwritten extents during zero range processing,
> which led to stale data exposure. XFS had to flush dirty ranges itself
> before zeroing — a workaround that was not generic.
>
> In c5c810b94cf ("iomap: fix handling of dirty folios over unwritten
> extents"), Brian added an unconditional flush in the iomap
> infrastructure, ensuring that by the time zeroing runs the extent has
> already been converted to written so the zero can proceed correctly.
> However, this flush was too heavy and introduced noticeable performance
> overhead.
>
> This was then optimized in 7d9b474ee4cc3 ("iomap: make zero range flush
> conditional on unwritten mappings"), which restricts flushing to only
> dirty pagecache over unwritten or hole mappings.
>
> Brian later proposed a different approach: rather than relying on flush
> to convert the extent type, find dirty folios ahead of the zero range
> and zero the dirty unwritten extents directly. In [1] he added this
> lookup logic. The filesystem now supplies a folio batch (a collection of
> dirty folios) via the iomap begin callback, and zero range iterates over
> these dirty folios to perform zeroing. Clean regions not covered by the
> batch are simply skipped. This entirely eliminates the need to flush.
>
> [1] https://lore.kernel.org/linux-xfs/20251003134642.604736-1-bfoster@xxxxxxxxxx/
>
> If I understand correctly, the current approach is a compromise, and
> Brian is still working on this. Perhaps ext4 and XFS could work together
> on improvements in the future?
>

I think that about covers it!

I do agree wrt to the iomap_iter thing in that it doesn't seem like the
most elegant thing. I considered that a bit of a roadblock when first
hacking on the batch stuff, but IIRC somebody pointed out that there was
precedent already so I didn't think too hard about it after that. Indeed
if you poke around, other filesystems use a similar pattern to access
iter->private for whatever private context is carried around.

FWIW, one of the longer term thoughts for the dirty folio stuff was to
eventually lift it out of the callback and just have iomap do it for the
fs. That would eliminate this particular pattern and probably clean
things up a bit, but there were also some other caveats with that that
aren't top of mind atm (IIRC, things like dealing with map trimming,
etc., but I haven't had a chance to think about it in a while).

Also note that this isn't necessarily a hard requirement. It's an
optional optimization. iomap will flush and retry in the dirty
pagecache+unwritten extent case if the fs hasn't otherwised provided
folios to make sure it zeroes properly, it's just that performance of
that may or may not be acceptable for your use case.

Brian

> >
> >> + ret = iomap_zero_range(inode, from, length, did_zero,
> >> + &ext4_iomap_zero_ops, &ext4_iomap_write_ops,
> >> + NULL);
> >> + if (ret)
> >> + return ret;
> >> +
> >> + /*
> >> + * TODO: The iomap does not distinguish between different types of
> >> + * zeroing and always sets zero_written if a zeroing operation is
> >> + * performed, which may result in unnecessary order operations.
> >> + */
> >
> > Is this still true after your fix to did_zero handling?
>
> Yeah. Currently, iomap_zero_range() can only report whether a zeroing
> operation has occurred through did_zero parameter, but it cannot
> distinguish whether the zeroed range is a written extent that already
> exists on disk. That is, even if the zeroing is performed on a delalloc
> extent, did_zero will still return true.
>
> Thanks,
> Yi.
>
> >
> >> + if (did_zero && zero_written)
> >> + *zero_written = *did_zero;
> >> +
> >> + return 0;
> >> +}
> >> +
> >> /*
> >> * Zeros out a mapping of length 'length' starting from file offset
> >> * 'from'. The range to be zero'd must be contained with in one block.
> >
> > Honza
>