Re: [PATCH v4 18/23] ext4: wait for ordered I/O in the iomap buffered I/O path

From: Zhang Yi

Date: Mon Jun 01 2026 - 23:31:12 EST

On 6/2/2026 2:33 AM, Ojaswin Mujoo wrote:

On Sat, May 30, 2026 at 04:24:24PM +0800, Zhang Yi wrote:

On 5/30/2026 3:22 PM, Zhang Yi wrote:

Hi, Ojaswin!

On 5/27/2026 11:58 PM, Ojaswin Mujoo wrote:

On Mon, May 11, 2026 at 03:23:38PM +0800, Zhang Yi wrote:

From: Zhang Yi <yi.zhang@xxxxxxxxxx>

For append writes, wait for ordered I/O to complete before updating
i_disksize. This ensures that zeroed data is flushed to disk before the
metadata update, preventing stale data from being exposed during
unaligned post-EOF append writes.

Suggested-by: Jan Kara <jack@xxxxxxx>
Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
---
fs/ext4/ext4.h | 11 +++++++
fs/ext4/inode.c | 80 ++++++++++++++++++++++++++++++++++++++++++-----
fs/ext4/page-io.c | 60 +++++++++++++++++++++++++++++++++++
fs/ext4/super.c | 23 ++++++++++----
4 files changed, 161 insertions(+), 13 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 078feda47e36..9ce2128eea3e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1195,6 +1195,15 @@ struct ext4_inode_info {
#ifdef CONFIG_FS_ENCRYPTION
struct fscrypt_inode_info *i_crypt_info;
#endif
+
+ /*
+ * Track ordered zeroed data during post-EOF append writes, fallocate,
+ * and truncate-up operations. These parameters are used only in the
+ * iomap buffered I/O path.
+ */
+ ext4_lblk_t i_ordered_lblk;
+ ext4_lblk_t i_ordered_len;
+ wait_queue_head_t i_ordered_wq;
};
/*
@@ -3858,6 +3867,8 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
__u64 len, __u64 *moved_len);
/* page-io.c */
+#define EXT4_IOMAP_IOEND_ORDER_IO 1UL /* This I/O is an ordered one */
+
extern int __init ext4_init_pageio(void);
extern void ext4_exit_pageio(void);
extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e013aeb03d7b..11fb369efeb1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4345,6 +4345,7 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
{
struct iomap_ioend *ioend = wpc->wb_ctx;
struct ext4_inode_info *ei = EXT4_I(ioend->io_inode);
+ ext4_lblk_t start, end, order_lblk, order_len;
/*
* After I/O completion, a worker needs to be scheduled when:
@@ -4357,6 +4358,30 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
test_opt(ioend->io_inode->i_sb, DATA_ERR_ABORT))
ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
+ /*
+ * Mark the I/O as ordered. Ordered I/O requires separate endio
+ * handling and must not be merged with regular I/O operations.
+ */
+ order_len = READ_ONCE(ei->i_ordered_len);
+ if (order_len) {
+ /*
+ * Pair with smp_store_release() in ext4_block_zero_eof().
+ * Ensure we see the updated i_ordered_lblk that was written
+ * before the release store to i_ordered_len.
+ */
+ smp_rmb();
+ order_lblk = READ_ONCE(ei->i_ordered_lblk);
+ start = ioend->io_offset >> ioend->io_inode->i_blkbits;
+ end = EXT4_B_TO_LBLK(ioend->io_inode,
+ ioend->io_offset + ioend->io_size);
+
+ if (start <= order_lblk && end >= order_lblk + order_len) {

Hi Zhang,

I guess this check is enough cause ordered_lblk and ordered_len will
always be contained in a single block.

Yeah.

+ ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
+ ioend->io_private = (void *)EXT4_IOMAP_IOEND_ORDER_IO;
+ ioend->io_flags |= IOMAP_IOEND_BOUNDARY;

FWIU, we are wanting the ordered IO to not be merged and submitted asap
since we want to wake up the waiters. Is there any other reason?

My original intention was to prevent the loss of the
EXT4_IOMAP_IOEND_ORDER_IO flag during worker processing triggered by I/O
completion, which could be caused by merging an ordered ioend with a
normal ioend. In patch 19, we need to determine the flag to update
i_disksize to the correct position.

Ahh okay, we don't want the flag to be lost.

Adding the boundary in ->writeback_submit() only affects
iomap_ioend_can_merge() which happens after we have woken up the waiters
and deferred the IO to the wq. We ideally want it affect
iomap_can_add_to_ioend() ie we need to add IOMAP_F_BOUNDARY in
->writeback_range().

IIUC, merging into the same ioend during the submission stage doesn't
seem to cause any problems.

Got it since the flag is set later. I was thinking we want to quickly
issue the ordered IO to wake up the waiters and not waste time trying to
merge it and hence we wanted to use that flag.

Secondly, I don't think boundary is the right flag here. It ensures
that everything before the ordered iomap gets submitted and the ordered
iomap starts a new ioend. This can still keep getting merged with the
newer ioends untils we decide to submit the IO, which can delay waking
up the waiters. If we really want the "no merge" behavior, we'll have to
do something like [1] (Check the 2 NOMERGE flag patches).

Yeah, IOMAP_IOEND_BOUNDARY appears to be just a one-way barrier and
still cannot prevent merging. I missed this, thank you for pointing this
out. However, I think perhaps we should change iomap_ioend_can_merge()
to check the iomap_ioend->io_private. Something like:

if (ioend->io_private || next->io_private)

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ioend->io_private != next->io_private

I guess if the purpose is to just not lose the flag, then boundary seems
to work for because we only lose the flag if ordered ioend backward
merges to a prev one. Flag is retained if we forward merge. Which
boundary seems to take care of.

Yes, IOMAP_IOEND_BOUNDARY is indeed worked currently as it prevents flag
loss. However, from the perspective of the iomap infrastructure, I
believe it is still necessary to add the ioend->io_private !=
next->io_private check. Because ioends with different io_private values
should not be merged, as this carries the risk of potentially losing
io_private or even memory leaks. With this check in iomap, we would no
longer need IOMAP_IOEND_BOUNDARY.

However, if we want to avoid merges so we can quickly issue IO and wake
up the waiters then the above change looks good. Also, if this is the
reason we'd also want to have this during submission stage so the flag
setting will probs have to move to ->wirteback_range()

Yes. Issuing ordered I/O as soon as possible is beneficial as it reduces
the latency of sync file range. Suppose when we are syncing data beyond
the ordered range, the background writeback process has already started
committing and bundled the ordered range into a large ioend (up to
IOEND_BATCH_SIZE folios), then this sync operation will indeed
experience significant latency. However, for other non-sync scenarios,
there should be little benefit.

But I'm not sure if this is strictly necessary, because in the existing
implementation, issuing ordered I/O via data=ordered mode works the same
way — it also doesn't issue ordered I/O as soon as possible, and still
has to wait when encountering concurrent background writeback. So I
think we can keep the current implementation for now and see user
feedback to decide whether further optimization is needed.

Cheers,
Yi.