Re: [PATCH v4 18/23] ext4: wait for ordered I/O in the iomap buffered I/O path

From: Jan Kara

Date: Thu Jun 25 2026 - 05:46:23 EST

On Mon 22-06-26 11:32:16, Zhang Yi wrote:
> On 6/18/2026 9:48 PM, Jan Kara wrote:
> > On Mon 11-05-26 15:23:38, Zhang Yi wrote:
> > > From: Zhang Yi <yi.zhang@xxxxxxxxxx>
> > >
> > > For append writes, wait for ordered I/O to complete before updating
> > > i_disksize. This ensures that zeroed data is flushed to disk before the
> > > metadata update, preventing stale data from being exposed during
> > > unaligned post-EOF append writes.
> > >
> > > Suggested-by: Jan Kara <jack@xxxxxxx>
> > > Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
> >
> > Frankly, this all looks too complex to me. Plus your are adding 32-bytes to
> > struct ext4_inode_info which isn't great either. Why don't you just do
> > filemap_fdatawait() for the byte at old i_disksize and be done with it?
> >
> > I believe we have to simplify this. All this complexity (and thus
> > maintenance burden) across several patches for the corner case of zeroing
> > tail block on extention is in my opinion difficult to justify.
> >
> > Honza
>
> Hi, Jan!
>
> Thanks for the review. I understand the concern about complexity and the
> 32-byte increase to ext4_inode_info. I tried using
> filemap_fdatawait_range() as you suggested, but found two issues where
> this solution doesn't work.
>
> 1. ioend worker deadlock
>
> Since worker concurrency resources are limited, we cannot wait for
> another ioend worker to complete within one ioend worker with the same
> work_struct. If the worker calls
> filemap_fdatawait_range(byte_at_old_disksize) to wait for the zeroed
> block's folio writeback to complete, it sleeps holding the only worker
> slot. If the folio contains blocks requiring extent conversion, its
> writeback bit is cleared by iomap_finish_ioends() running inside
> another worker -- which can only run after the current worker finishes
> its batch.
>
> Concretely:
> - Worker W1 processes ioend A, calls filemap_fdatawait_range() on
> the old EOF byte, sleeps.
> - The zeroed data is in ioend B. bio_endio defers it to
> i_iomap_ioend_list and calls queue_work().
> - queue_work() on i_iomap_ioend_work is idempotent: it returns false
> because the work is currently executing (even though sleeping).
> - ioend B sits in the list, never gets processed.
> - The folio writeback bit is only cleared by processing ioend B.
> - W1 sleeps forever -> deadlock.

Yes, good point. We cannot wait for folio writeback completion from end_io
processing for another folio.

> Therefore, I think we have to put the wakeup logic in
> ext4_iomap_end_bio() that runs in interrupt context without consuming
> a worker thread. The ordered range tracking and wait queue are what
> make that possible.
>
> 2. Truncate-up needs an accurate state query
>
> In the follow-up patch 19, ext4_set_inode_size() must make a precise
> decision when updating i_disksize during truncate up.
>
> This needs a state query: "is there ordered zero I/O in flight right
> now?" If yes, the i_disksize update is deferred to
> ext4_iomap_wb_update_disksize(is_ordered=true), which advances
> i_disksize to i_size when the ordered I/O completes. If no, we must
> advance i_disksize immediately, otherwise we will lose the updating
> forever.
>
> Therefore, we need to track the state of the ordered range. Simply
> using filemap_fdatawait_range() doesn't work. i_ordered_len serves as
> a maintained state flag that both the ioend completion path and the
> setattr path can read atomically without sleeping.
>
> Suggestions?

I see. Thanks for explanation. I went back to our discussion back from
February to remind myself about the constraints on the tail block zeroing
and the i_disksize update mechanism. And in the light of complexity of the
current mechanism, I think we've discarded the following possibility too
easily:

* On file extend / truncate up just zero tail folio in the page cache, mark
it dirty, keep i_disksize at old value, update i_size to the new value,
add inode to orphan list.
If the i_disksize was block aligned (and so we skip zeroing), we just
update i_disksize rightaway.

* In io end processing if the folio for which we end io has a block which
straddles i_disksize, we update i_disksize to current i_size. We defer
removing inode from orphan list e.g. to file close time (doing it from
end_io processing is problematic locking wise as we need i_rwsem for it).

This is a very simple scheme with very good performace. It makes sure stale
data in the tail block cannot be seen on disk after a crash.

Potential problems with this:

* We need to do i_disksize update from io end processing which means starting a
transaction. So this mechanism has to be restricted to buffered IO iomap
path due to locking constraints. If the inode uses old ordered mode, it
has to keep using it also for handling of the tail block zeroing.
I think that's acceptable as we want to transition everything to iomap
anyway so this duplicity will eventually go away.

* After a crash we can see i_disksize already updated but file content will
show zeros. This is not breaking any guarantees, it just changes how ext4
behaves after a crash. Again, I think this is acceptable tradeoff for the
simplicity.

What do you think? Any problems I have missed? I'm sorry for poking into
this mechanism again and again but I want to keep the code as simple as
possible to make our life easier in the future...

> Regarding the bloat of ext4_inode_info, perhaps we can drop the
> wait_queue_head_t (24 bytes) and use wait_var_event()/ wake_up_var()
> instead. Would this be acceptable?

Yes, that would be a good way to reduce the bloat if we still need this.

Honza
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR