Re: [PATCH v4 14/23] ext4: implement partial block zero range path using iomap

From: Zhang Yi

Date: Wed Jun 17 2026 - 09:27:04 EST


On 6/17/2026 6:56 PM, Brian Foster wrote:
On Wed, Jun 17, 2026 at 04:14:40PM +0800, Zhang Yi wrote:
On 6/16/2026 8:28 PM, Jan Kara wrote:
On Mon 11-05-26 15:23:34, Zhang Yi wrote:
From: Zhang Yi <yi.zhang@xxxxxxxxxx>

Introduce a new iomap_ops instance, ext4_iomap_zero_ops, along with
ext4_iomap_block_zero_range() to implement block zeroing via the iomap
infrastructure for ext4.

ext4_iomap_block_zero_range() calls iomap_zero_range() with
ext4_iomap_zero_begin() as the callback. The callback locates and zeros
out either a mapped partial block or a dirty, unwritten partial block.

Important constraints:

Zeroing out under an active journal handle can cause deadlock, because
the order of acquiring the folio lock and starting a handle is
inconsistent with the iomap writeback path.

Therefore, ext4_iomap_block_zero_range():
- Must NOT be called under an active handle.
- Cannot rely on data=ordered mode to ensure zeroed data persistence
before updating i_disksize (for the cases of post-EOF append write,
post-EOF fallocate, and truncate up). In subsequent patches, we will
address this by synchronizing commit I/O but doesn't waiting for
completion, and updating i_disksize to i_size only after the zeroed
data has been written back.

Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
---
fs/ext4/inode.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 92 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c6fe42d012fc..e0dae2501292 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4101,6 +4101,51 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
return 0;
}
+static int ext4_iomap_zero_begin(struct inode *inode,
+ loff_t offset, loff_t length, unsigned int flags,
+ struct iomap *iomap, struct iomap *srcmap)
+{
+ struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);

This looks like a layering violation to me. I don't think you can safely
assume the iomap you're passed is a part of iomap_iter...

+ struct ext4_map_blocks map;
+ u8 blkbits = inode->i_blkbits;
+ unsigned int iomap_flags = 0;
+ int ret;
+
+ ret = ext4_emergency_state(inode->i_sb);
+ if (unlikely(ret))
+ return ret;
+
+ if (WARN_ON_ONCE(!(flags & IOMAP_ZERO)))
+ return -EINVAL;
+
+ ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
+ if (ret < 0)
+ return ret;
+
+ /*
+ * Look up dirty folios for unwritten mappings within EOF. Providing
+ * this bypasses the flush iomap uses to trigger extent conversion
+ * when unwritten mappings have dirty pagecache in need of zeroing.
+ */
+ if (map.m_flags & EXT4_MAP_UNWRITTEN) {
+ loff_t start = ((loff_t)map.m_lblk) << blkbits;
+ loff_t end = ((loff_t)map.m_lblk + map.m_len) << blkbits;
+
+ iomap_fill_dirty_folios(iter, &start, end, &iomap_flags);
+ if ((start >> blkbits) < map.m_lblk + map.m_len)
+ map.m_len = (start >> blkbits) - map.m_lblk;
+ }

... and you need access to iter only for this which seems to be really a
hack that's trying to outsmart the iomap code. I have to admit I don't
fully understand what you are trying to achieve here. Are you trying to
avoid flushing of the range that will be zeroed out?

This logic is copied from the XFS and iomap infrastructure. Its primary
purpose is to optimize the zeroing operations on dirty written extents.
It was introduced by Brian in [1].

The history as I understand it: originally, the iomap infrastructure
could not zero dirty unwritten extents during zero range processing,
which led to stale data exposure. XFS had to flush dirty ranges itself
before zeroing — a workaround that was not generic.

In c5c810b94cf ("iomap: fix handling of dirty folios over unwritten
extents"), Brian added an unconditional flush in the iomap
infrastructure, ensuring that by the time zeroing runs the extent has
already been converted to written so the zero can proceed correctly.
However, this flush was too heavy and introduced noticeable performance
overhead.

This was then optimized in 7d9b474ee4cc3 ("iomap: make zero range flush
conditional on unwritten mappings"), which restricts flushing to only
dirty pagecache over unwritten or hole mappings.

Brian later proposed a different approach: rather than relying on flush
to convert the extent type, find dirty folios ahead of the zero range
and zero the dirty unwritten extents directly. In [1] he added this
lookup logic. The filesystem now supplies a folio batch (a collection of
dirty folios) via the iomap begin callback, and zero range iterates over
these dirty folios to perform zeroing. Clean regions not covered by the
batch are simply skipped. This entirely eliminates the need to flush.

[1] https://lore.kernel.org/linux-xfs/20251003134642.604736-1-bfoster@xxxxxxxxxx/

If I understand correctly, the current approach is a compromise, and
Brian is still working on this. Perhaps ext4 and XFS could work together
on improvements in the future?


I think that about covers it!

I do agree wrt to the iomap_iter thing in that it doesn't seem like the
most elegant thing. I considered that a bit of a roadblock when first
hacking on the batch stuff, but IIRC somebody pointed out that there was
precedent already so I didn't think too hard about it after that. Indeed
if you poke around, other filesystems use a similar pattern to access
iter->private for whatever private context is carried around.

FWIW, one of the longer term thoughts for the dirty folio stuff was to
eventually lift it out of the callback and just have iomap do it for the
fs. That would eliminate this particular pattern and probably clean
things up a bit, but there were also some other caveats with that that
aren't top of mind atm (IIRC, things like dealing with map trimming,
etc., but I haven't had a chance to think about it in a while).

Thanks for the additional information. This looks like a promising
direction. The usage on the ext4 side is relatively simple at the
moment, so the main concern is whether this would cause any issues for
XFS and other filesystems. This can be investigated later when time
permits — it's not urgent at this point.


Also note that this isn't necessarily a hard requirement. It's an
optional optimization. iomap will flush and retry in the dirty
pagecache+unwritten extent case if the fs hasn't otherwised provided
folios to make sure it zeroes properly, it's just that performance of
that may or may not be acceptable for your use case.

Brian

Yes, we will try to adopt the folio batch approach as much as possible,
hoping to reduce unnecessary performance overhead through it.

Thanks,
Yi.



+ ret = iomap_zero_range(inode, from, length, did_zero,
+ &ext4_iomap_zero_ops, &ext4_iomap_write_ops,
+ NULL);
+ if (ret)
+ return ret;
+
+ /*
+ * TODO: The iomap does not distinguish between different types of
+ * zeroing and always sets zero_written if a zeroing operation is
+ * performed, which may result in unnecessary order operations.
+ */

Is this still true after your fix to did_zero handling?

Yeah. Currently, iomap_zero_range() can only report whether a zeroing
operation has occurred through did_zero parameter, but it cannot
distinguish whether the zeroed range is a written extent that already
exists on disk. That is, even if the zeroing is performed on a delalloc
extent, did_zero will still return true.

Thanks,
Yi.


+ if (did_zero && zero_written)
+ *zero_written = *did_zero;
+
+ return 0;
+}
+
/*
* Zeros out a mapping of length 'length' starting from file offset
* 'from'. The range to be zero'd must be contained with in one block.

Honza