Re: [PATCH v4 14/23] ext4: implement partial block zero range path using iomap

From: Zhang Yi

Date: Wed Jun 17 2026 - 09:02:41 EST

On 6/17/2026 6:50 PM, Jan Kara wrote:

On Wed 17-06-26 16:14:40, Zhang Yi wrote:

On 6/16/2026 8:28 PM, Jan Kara wrote:

On Mon 11-05-26 15:23:34, Zhang Yi wrote:

From: Zhang Yi <yi.zhang@xxxxxxxxxx>

Introduce a new iomap_ops instance, ext4_iomap_zero_ops, along with
ext4_iomap_block_zero_range() to implement block zeroing via the iomap
infrastructure for ext4.

ext4_iomap_block_zero_range() calls iomap_zero_range() with
ext4_iomap_zero_begin() as the callback. The callback locates and zeros
out either a mapped partial block or a dirty, unwritten partial block.

Important constraints:

Zeroing out under an active journal handle can cause deadlock, because
the order of acquiring the folio lock and starting a handle is
inconsistent with the iomap writeback path.

Therefore, ext4_iomap_block_zero_range():
- Must NOT be called under an active handle.
- Cannot rely on data=ordered mode to ensure zeroed data persistence
before updating i_disksize (for the cases of post-EOF append write,
post-EOF fallocate, and truncate up). In subsequent patches, we will
address this by synchronizing commit I/O but doesn't waiting for
completion, and updating i_disksize to i_size only after the zeroed
data has been written back.

Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
---
fs/ext4/inode.c | 92 +++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 92 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c6fe42d012fc..e0dae2501292 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4101,6 +4101,51 @@ static int ext4_iomap_buffered_da_write_end(struct inode *inode, loff_t offset,
return 0;
}
+static int ext4_iomap_zero_begin(struct inode *inode,
+ loff_t offset, loff_t length, unsigned int flags,
+ struct iomap *iomap, struct iomap *srcmap)
+{
+ struct iomap_iter *iter = container_of(iomap, struct iomap_iter, iomap);

This looks like a layering violation to me. I don't think you can safely
assume the iomap you're passed is a part of iomap_iter...

+ struct ext4_map_blocks map;
+ u8 blkbits = inode->i_blkbits;
+ unsigned int iomap_flags = 0;
+ int ret;
+
+ ret = ext4_emergency_state(inode->i_sb);
+ if (unlikely(ret))
+ return ret;
+
+ if (WARN_ON_ONCE(!(flags & IOMAP_ZERO)))
+ return -EINVAL;
+
+ ret = ext4_iomap_map_blocks(inode, offset, length, NULL, &map);
+ if (ret < 0)
+ return ret;
+
+ /*
+ * Look up dirty folios for unwritten mappings within EOF. Providing
+ * this bypasses the flush iomap uses to trigger extent conversion
+ * when unwritten mappings have dirty pagecache in need of zeroing.
+ */
+ if (map.m_flags & EXT4_MAP_UNWRITTEN) {
+ loff_t start = ((loff_t)map.m_lblk) << blkbits;
+ loff_t end = ((loff_t)map.m_lblk + map.m_len) << blkbits;
+
+ iomap_fill_dirty_folios(iter, &start, end, &iomap_flags);
+ if ((start >> blkbits) < map.m_lblk + map.m_len)
+ map.m_len = (start >> blkbits) - map.m_lblk;
+ }

... and you need access to iter only for this which seems to be really a
hack that's trying to outsmart the iomap code. I have to admit I don't
fully understand what you are trying to achieve here. Are you trying to
avoid flushing of the range that will be zeroed out?

This logic is copied from the XFS and iomap infrastructure. Its primary
purpose is to optimize the zeroing operations on dirty written extents.
It was introduced by Brian in [1].

Ah, I see. I still find it hacky but apparently it is an established hack
in iomap :). Fair.

The history as I understand it: originally, the iomap infrastructure
could not zero dirty unwritten extents during zero range processing,
which led to stale data exposure. XFS had to flush dirty ranges itself
before zeroing — a workaround that was not generic.

In c5c810b94cf ("iomap: fix handling of dirty folios over unwritten
extents"), Brian added an unconditional flush in the iomap
infrastructure, ensuring that by the time zeroing runs the extent has
already been converted to written so the zero can proceed correctly.
However, this flush was too heavy and introduced noticeable performance
overhead.

This was then optimized in 7d9b474ee4cc3 ("iomap: make zero range flush
conditional on unwritten mappings"), which restricts flushing to only
dirty pagecache over unwritten or hole mappings.

Brian later proposed a different approach: rather than relying on flush
to convert the extent type, find dirty folios ahead of the zero range
and zero the dirty unwritten extents directly. In [1] he added this
lookup logic. The filesystem now supplies a folio batch (a collection of
dirty folios) via the iomap begin callback, and zero range iterates over
these dirty folios to perform zeroing. Clean regions not covered by the
batch are simply skipped. This entirely eliminates the need to flush.

[1] https://lore.kernel.org/linux-xfs/20251003134642.604736-1-bfoster@xxxxxxxxxx/

Thanks for the summary! So I was confused because somehow I thought this is
about fallocate(FALLOC_FL_ZERO_RANGE) and so I was wondering why we just
cannot evict the page cache and be done with that. Only after reading
everything again I've realized this is about zeroing partial blocks on hole
punch etc. And we may need to really handle multiple folios because XFS
also uses this mechanism to implement FALLOC_FL_ZERO_RANGE for zoned
storage. Ugh. OK, anyway for now this looks like your patch is following
how things are expected to be done so feel free to add:

Reviewed-by: Jan Kara <jack@xxxxxxx>

+ /*
+ * TODO: The iomap does not distinguish between different types of
+ * zeroing and always sets zero_written if a zeroing operation is
+ * performed, which may result in unnecessary order operations.
+ */

Is this still true after your fix to did_zero handling?

Yeah. Currently, iomap_zero_range() can only report whether a zeroing
operation has occurred through did_zero parameter, but it cannot
distinguish whether the zeroed range is a written extent that already
exists on disk. That is, even if the zeroing is performed on a delalloc
extent, did_zero will still return true.

So maybe write in the comment explicitely, that this may result in
unnecessary flushing of folios if zeroing happened in
delayed-not-yet-allocated blocks?

Honza

Sure, I'll include it in the next iteration. :)

Thanks,
Yi.