Re: [PATCH v4 18/23] ext4: wait for ordered I/O in the iomap buffered I/O path

From: Ojaswin Mujoo

Date: Mon Jun 01 2026 - 14:37:57 EST


On Sat, May 30, 2026 at 04:24:24PM +0800, Zhang Yi wrote:
> On 5/30/2026 3:22 PM, Zhang Yi wrote:
> > Hi, Ojaswin!
> >
> > On 5/27/2026 11:58 PM, Ojaswin Mujoo wrote:
> >> On Mon, May 11, 2026 at 03:23:38PM +0800, Zhang Yi wrote:
> >>> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
> >>>
> >>> For append writes, wait for ordered I/O to complete before updating
> >>> i_disksize. This ensures that zeroed data is flushed to disk before the
> >>> metadata update, preventing stale data from being exposed during
> >>> unaligned post-EOF append writes.
> >>>
> >>> Suggested-by: Jan Kara <jack@xxxxxxx>
> >>> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
> >>> ---
> >>> fs/ext4/ext4.h | 11 +++++++
> >>> fs/ext4/inode.c | 80 ++++++++++++++++++++++++++++++++++++++++++-----
> >>> fs/ext4/page-io.c | 60 +++++++++++++++++++++++++++++++++++
> >>> fs/ext4/super.c | 23 ++++++++++----
> >>> 4 files changed, 161 insertions(+), 13 deletions(-)
> >>>
> >>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> >>> index 078feda47e36..9ce2128eea3e 100644
> >>> --- a/fs/ext4/ext4.h
> >>> +++ b/fs/ext4/ext4.h
> >>> @@ -1195,6 +1195,15 @@ struct ext4_inode_info {
> >>> #ifdef CONFIG_FS_ENCRYPTION
> >>> struct fscrypt_inode_info *i_crypt_info;
> >>> #endif
> >>> +
> >>> + /*
> >>> + * Track ordered zeroed data during post-EOF append writes, fallocate,
> >>> + * and truncate-up operations. These parameters are used only in the
> >>> + * iomap buffered I/O path.
> >>> + */
> >>> + ext4_lblk_t i_ordered_lblk;
> >>> + ext4_lblk_t i_ordered_len;
> >>> + wait_queue_head_t i_ordered_wq;
> >>> };
> >>>
> >>> /*
> >>> @@ -3858,6 +3867,8 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
> >>> __u64 len, __u64 *moved_len);
> >>>
> >>> /* page-io.c */
> >>> +#define EXT4_IOMAP_IOEND_ORDER_IO 1UL /* This I/O is an ordered one */
> >>> +
> >>> extern int __init ext4_init_pageio(void);
> >>> extern void ext4_exit_pageio(void);
> >>> extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
> >>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >>> index e013aeb03d7b..11fb369efeb1 100644
> >>> --- a/fs/ext4/inode.c
> >>> +++ b/fs/ext4/inode.c
> >>> @@ -4345,6 +4345,7 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
> >>> {
> >>> struct iomap_ioend *ioend = wpc->wb_ctx;
> >>> struct ext4_inode_info *ei = EXT4_I(ioend->io_inode);
> >>> + ext4_lblk_t start, end, order_lblk, order_len;
> >>>
> >>> /*
> >>> * After I/O completion, a worker needs to be scheduled when:
> >>> @@ -4357,6 +4358,30 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
> >>> test_opt(ioend->io_inode->i_sb, DATA_ERR_ABORT))
> >>> ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
> >>>
> >>> + /*
> >>> + * Mark the I/O as ordered. Ordered I/O requires separate endio
> >>> + * handling and must not be merged with regular I/O operations.
> >>> + */
> >>> + order_len = READ_ONCE(ei->i_ordered_len);
> >>> + if (order_len) {
> >>> + /*
> >>> + * Pair with smp_store_release() in ext4_block_zero_eof().
> >>> + * Ensure we see the updated i_ordered_lblk that was written
> >>> + * before the release store to i_ordered_len.
> >>> + */
> >>> + smp_rmb();
> >>> + order_lblk = READ_ONCE(ei->i_ordered_lblk);
> >>> + start = ioend->io_offset >> ioend->io_inode->i_blkbits;
> >>> + end = EXT4_B_TO_LBLK(ioend->io_inode,
> >>> + ioend->io_offset + ioend->io_size);
> >>> +
> >>> + if (start <= order_lblk && end >= order_lblk + order_len) {
> >>
> >> Hi Zhang,
> >>
> >> I guess this check is enough cause ordered_lblk and ordered_len will
> >> always be contained in a single block.
> >
> > Yeah.
> >
> >>
> >>> + ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
> >>> + ioend->io_private = (void *)EXT4_IOMAP_IOEND_ORDER_IO;
> >>> + ioend->io_flags |= IOMAP_IOEND_BOUNDARY;
> >>
> >> FWIU, we are wanting the ordered IO to not be merged and submitted asap
> >> since we want to wake up the waiters. Is there any other reason?
> >
> > My original intention was to prevent the loss of the
> > EXT4_IOMAP_IOEND_ORDER_IO flag during worker processing triggered by I/O
> > completion, which could be caused by merging an ordered ioend with a
> > normal ioend. In patch 19, we need to determine the flag to update
> > i_disksize to the correct position.

Ahh okay, we don't want the flag to be lost.

> >
> >>
> >> Adding the boundary in ->writeback_submit() only affects
> >> iomap_ioend_can_merge() which happens after we have woken up the waiters
> >> and deferred the IO to the wq. We ideally want it affect
> >> iomap_can_add_to_ioend() ie we need to add IOMAP_F_BOUNDARY in
> >> ->writeback_range().
> >
> > IIUC, merging into the same ioend during the submission stage doesn't
> > seem to cause any problems.

Got it since the flag is set later. I was thinking we want to quickly
issue the ordered IO to wake up the waiters and not waste time trying to
merge it and hence we wanted to use that flag.

> >
> >>
> >> Secondly, I don't think boundary is the right flag here. It ensures
> >> that everything before the ordered iomap gets submitted and the ordered
> >> iomap starts a new ioend. This can still keep getting merged with the
> >> newer ioends untils we decide to submit the IO, which can delay waking
> >> up the waiters. If we really want the "no merge" behavior, we'll have to
> >> do something like [1] (Check the 2 NOMERGE flag patches).
> >
> > Yeah, IOMAP_IOEND_BOUNDARY appears to be just a one-way barrier and
> > still cannot prevent merging. I missed this, thank you for pointing this
> > out. However, I think perhaps we should change iomap_ioend_can_merge()
> > to check the iomap_ioend->io_private. Something like:
> >
> > if (ioend->io_private || next->io_private)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> ioend->io_private != next->io_private

I guess if the purpose is to just not lose the flag, then boundary seems
to work for because we only lose the flag if ordered ioend backward
merges to a prev one. Flag is retained if we forward merge. Which
boundary seems to take care of.

However, if we want to avoid merges so we can quickly issue IO and wake
up the waiters then the above change looks good. Also, if this is the
reason we'd also want to have this during submission stage so the flag
setting will probs have to move to ->wirteback_range()

Regards,
Ojaswin


>
>
> > return false;
> >
> > What do you think?
> >
> > Thanks,
> > Yi.
>