Re: [PATCH v4 18/23] ext4: wait for ordered I/O in the iomap buffered I/O path
From: Zhang Yi
Date: Sat May 30 2026 - 04:24:53 EST
On 5/30/2026 3:22 PM, Zhang Yi wrote:
> Hi, Ojaswin!
>
> On 5/27/2026 11:58 PM, Ojaswin Mujoo wrote:
>> On Mon, May 11, 2026 at 03:23:38PM +0800, Zhang Yi wrote:
>>> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
>>>
>>> For append writes, wait for ordered I/O to complete before updating
>>> i_disksize. This ensures that zeroed data is flushed to disk before the
>>> metadata update, preventing stale data from being exposed during
>>> unaligned post-EOF append writes.
>>>
>>> Suggested-by: Jan Kara <jack@xxxxxxx>
>>> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
>>> ---
>>> fs/ext4/ext4.h | 11 +++++++
>>> fs/ext4/inode.c | 80 ++++++++++++++++++++++++++++++++++++++++++-----
>>> fs/ext4/page-io.c | 60 +++++++++++++++++++++++++++++++++++
>>> fs/ext4/super.c | 23 ++++++++++----
>>> 4 files changed, 161 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>> index 078feda47e36..9ce2128eea3e 100644
>>> --- a/fs/ext4/ext4.h
>>> +++ b/fs/ext4/ext4.h
>>> @@ -1195,6 +1195,15 @@ struct ext4_inode_info {
>>> #ifdef CONFIG_FS_ENCRYPTION
>>> struct fscrypt_inode_info *i_crypt_info;
>>> #endif
>>> +
>>> + /*
>>> + * Track ordered zeroed data during post-EOF append writes, fallocate,
>>> + * and truncate-up operations. These parameters are used only in the
>>> + * iomap buffered I/O path.
>>> + */
>>> + ext4_lblk_t i_ordered_lblk;
>>> + ext4_lblk_t i_ordered_len;
>>> + wait_queue_head_t i_ordered_wq;
>>> };
>>>
>>> /*
>>> @@ -3858,6 +3867,8 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
>>> __u64 len, __u64 *moved_len);
>>>
>>> /* page-io.c */
>>> +#define EXT4_IOMAP_IOEND_ORDER_IO 1UL /* This I/O is an ordered one */
>>> +
>>> extern int __init ext4_init_pageio(void);
>>> extern void ext4_exit_pageio(void);
>>> extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>> index e013aeb03d7b..11fb369efeb1 100644
>>> --- a/fs/ext4/inode.c
>>> +++ b/fs/ext4/inode.c
>>> @@ -4345,6 +4345,7 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
>>> {
>>> struct iomap_ioend *ioend = wpc->wb_ctx;
>>> struct ext4_inode_info *ei = EXT4_I(ioend->io_inode);
>>> + ext4_lblk_t start, end, order_lblk, order_len;
>>>
>>> /*
>>> * After I/O completion, a worker needs to be scheduled when:
>>> @@ -4357,6 +4358,30 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
>>> test_opt(ioend->io_inode->i_sb, DATA_ERR_ABORT))
>>> ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
>>>
>>> + /*
>>> + * Mark the I/O as ordered. Ordered I/O requires separate endio
>>> + * handling and must not be merged with regular I/O operations.
>>> + */
>>> + order_len = READ_ONCE(ei->i_ordered_len);
>>> + if (order_len) {
>>> + /*
>>> + * Pair with smp_store_release() in ext4_block_zero_eof().
>>> + * Ensure we see the updated i_ordered_lblk that was written
>>> + * before the release store to i_ordered_len.
>>> + */
>>> + smp_rmb();
>>> + order_lblk = READ_ONCE(ei->i_ordered_lblk);
>>> + start = ioend->io_offset >> ioend->io_inode->i_blkbits;
>>> + end = EXT4_B_TO_LBLK(ioend->io_inode,
>>> + ioend->io_offset + ioend->io_size);
>>> +
>>> + if (start <= order_lblk && end >= order_lblk + order_len) {
>>
>> Hi Zhang,
>>
>> I guess this check is enough cause ordered_lblk and ordered_len will
>> always be contained in a single block.
>
> Yeah.
>
>>
>>> + ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
>>> + ioend->io_private = (void *)EXT4_IOMAP_IOEND_ORDER_IO;
>>> + ioend->io_flags |= IOMAP_IOEND_BOUNDARY;
>>
>> FWIU, we are wanting the ordered IO to not be merged and submitted asap
>> since we want to wake up the waiters. Is there any other reason?
>
> My original intention was to prevent the loss of the
> EXT4_IOMAP_IOEND_ORDER_IO flag during worker processing triggered by I/O
> completion, which could be caused by merging an ordered ioend with a
> normal ioend. In patch 19, we need to determine the flag to update
> i_disksize to the correct position.
>
>>
>> Adding the boundary in ->writeback_submit() only affects
>> iomap_ioend_can_merge() which happens after we have woken up the waiters
>> and deferred the IO to the wq. We ideally want it affect
>> iomap_can_add_to_ioend() ie we need to add IOMAP_F_BOUNDARY in
>> ->writeback_range().
>
> IIUC, merging into the same ioend during the submission stage doesn't
> seem to cause any problems.
>
>>
>> Secondly, I don't think boundary is the right flag here. It ensures
>> that everything before the ordered iomap gets submitted and the ordered
>> iomap starts a new ioend. This can still keep getting merged with the
>> newer ioends untils we decide to submit the IO, which can delay waking
>> up the waiters. If we really want the "no merge" behavior, we'll have to
>> do something like [1] (Check the 2 NOMERGE flag patches).
>
> Yeah, IOMAP_IOEND_BOUNDARY appears to be just a one-way barrier and
> still cannot prevent merging. I missed this, thank you for pointing this
> out. However, I think perhaps we should change iomap_ioend_can_merge()
> to check the iomap_ioend->io_private. Something like:
>
> if (ioend->io_private || next->io_private)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ioend->io_private != next->io_private
> return false;
>
> What do you think?
>
> Thanks,
> Yi.