Re: [PATCH v4 07/23] ext4: do not use data=ordered mode for inodes using buffered iomap path

From: Ojaswin Mujoo

Date: Wed May 20 2026 - 09:19:48 EST

On Wed, May 20, 2026 at 04:18:48PM +0800, Zhang Yi wrote:
> On 5/19/2026 9:31 PM, Ojaswin Mujoo wrote:
> > On Tue, May 19, 2026 at 04:11:30PM +0530, Ojaswin Mujoo wrote:
> >> On Mon, May 11, 2026 at 03:23:27PM +0800, Zhang Yi wrote:
> >>> From: Zhang Yi <yi.zhang@xxxxxxxxxx>
> >>>
> >>> The data=ordered mode introduces two fundamental conflicts with the
> >>> iomap buffered write path, leading to potential deadlocks.
> >>>
> >>> 1) Lock ordering conflict
> >>> In the iomap writeback path, each folio is processed sequentially:
> >>> the folio lock is acquired first, followed by starting a transaction
> >>> to create block mappings. In data=ordered mode, writeback triggered
> >>> by the journal commit process may attempt to acquire a folio lock
> >>> that is already held by iomap. Meanwhile, iomap, under that same
> >>> folio lock, may start a new transaction and wait for the currently
> >>> committing transaction to finish, resulting in a deadlock.
> >>
> >> Right, makes sense.
> >>>
> >>> 2) Partial folio submission not supported
> >>> When block size is smaller than folio size, a folio may contain both
> >>> mapped and unmapped blocks. In data=ordered mode, if the journal
> >>> waits for such a folio to be written back while the regular writeback
> >>> process has already started committing it (with the writeback flag
> >>> set), mapping the remaining unmapped blocks can deadlock. This is
> >>> because the writeback flag is cleared only after the entire folio is
> >>> processed and committed.
> >>
> >> Okay so IIUC, if we do end up using iomap with ordered data, there are 2
> >> codepaths with issues here:
> >>
> >> txn_commit
> >> ordered data writeback (say it goes via iomap)
> >> folio_lock
> >> iomap_writeback_folio
> >> folio_start_writeback
> >> iomap_writeback_range
> >> ext4_map_block
> >> txn_start
> >> wait for tnx commit - DEADLOCK
> >>
> >> Currently we avoid this by having ext4_normal_submit_inode_buffers()
> >> pass can_map = 0 so journal flush makese sure not to start any txn.
> >>
> >> Then we have
> >>
> >> txn_commit background writeback (via iomap)
> >>
> >> folio_lock()
> >> ordered data writeback
> >> folio_lock
> >>
> >> iomap_writeback_folio
> >> folio_start_writeback
> >> iomap_writeback_range
> >> ext4_map_block
> >> txn_start
> >> wait for txn commit - DEADLOCK
> >
> > Sorry I forget to remove tabs
> >
> > this is what I meant:
> >
> > txn_commit
> > ordered data writeback (say it goes via iomap)
> > folio_lock
> > iomap_writeback_folio
> > folio_start_writeback
> > iomap_writeback_range
> > ext4_map_block
> > txn_start
> > wait for tnx commit - DEADLOCK
> >
> > Currently we avoid this by having ext4_normal_submit_inode_buffers()
> > pass can_map = 0 so journal flush makese sure not to start any txn.
>
> Yeah, but we can also solve this problem by adding similar tags. This
> is not the most difficult part.
>
> >
> > Then we have
> >
> > txn_commit background writeback (via iomap)
> >
> > folio_lock()
> > ordered data writeback
> > folio_lock
> >
> > iomap_writeback_folio
> > folio_start_writeback
> > iomap_writeback_range
> > ext4_map_block
> > txn_start
> > wait for txn commit - DEADLOCK
> >
> >
> >>
> >> Currently, this is taken care because we try to start the txn before
> >> taking any folio locks/starting writeback, and hence we cannot deadlock.
>
> Yeah. You are right! Actually, this deadlock scenario should essentially
> belong to the first category: "Lock ordering conflict". This is not the
> scenario I want to describe here. The problematic scenario is as
> follows:
>
> T0: Assume we have a folio contains four blocks, from front to back,
> they are A, B, C, D. The last block D is written in delalloc mode
> (the block is not allocated yet).
>
> T1: The writeback process starts to write back data, set writeback flag
> on the folio, allocates block D, and adds it to transaction N's
> order list of jbd2 in JI_WAIT_DATA mode.
>
> T2: This folio completes the writeback and clears the writeback flag.
>
> T3: Before transaction N commit, we punch block B and C, and overwrite
> A-C,
>
> T4: Transaction N commit and folio writeback are running concurrently.
>
> Transaction N commit folio writeback(iomap)
>
> iomap_writeback_folio()
> folio_start_writeback() -- set writeback
>
> jbd2_journal_finish_inode_data_buffers()
> __filemap_fdatawait_range()
> -- wait writeback flag to clear
> iomap_writeback_range()
> ext4_map_block() -- map block B and C
> start handle
> wait for transaction N commit
> - DEADLOCK
>
> IOMAP does not support submitting partial folios during writeback.
> Therefore, the writeback flag is cleared only after the entire folio
> has been submitted. As a result, the commit of transaction N would never
> wait for this flag to be cleared if we need to map some blocks in this
> folio.

Hey Zhang,

thanks for the traces, it makes it so much clearer.

>
> Currently, this is handled by ext4_bio_write_folio(), which supports
> writing back partial folios. The writeback flag is only set after the
> block has been mapped and before the bio is actually issued. There are
> no other limitations that would block this flag from being cleared
> after the I/O is completed.

So IIUC the issue with iomap is that we set the writeback before
we start a txn handle and this way a txn handle can wait for a commit
which is waiting for writeback on the same folio to be cleared.

ext4_do_writepages currently starts the txn first, allocates and then
sets writeback and starts submitting buffers. This sequence ensures that
we dont get into the above deadlock because we never wait for commit
with a folio lock/writeback held.

Thanks,
Ojaswin
>
> >>
> >> If the above description makes sense, do you think it'd be good to add
> >> them to the commit message. The reason is that although these paths seem
> >> obvious when we look at them a lot, it took me a good bit of time to
> >> understand what deadlocks you are talking about here :p
> >>
> >> Having the code traces like above makes it very clear.
>
> Indeed, these problematic cases are complicated and subtle. I also spent
> some time recalling this scene. I can add these code traces in my next
> iteration.
>
> Thanks,
> Yi.
>
> >>>
> >>> To support data=ordered mode, the iomap core would need two invasive
> >>> changes:
> >>> - Acquire the transaction handle before locking any folio for
> >>> writeback.
> >>> - Support partial folio submission.
> >>>
> >>> Both changes are complicated and risk performance regressions.
> >>> Therefore, we must avoid using data=ordered mode when converting to the
> >>> iomap path.
> >>>
> >>> Currently, data=ordered mode is used in three scenarios:
> >>> - Append write
> >>> - Post-EOF partial block truncate-up followed by append write
> >>> - Online defragmentation
> >>>
> >>> We can address the first two without data=ordered mode:
> >>> - For append write: always allocate unwritten blocks (i.e. always
> >>> enable dioread_nolock), preserving the behavior of current
> >>> extent-type inodes.
> >>> - For post-EOF truncate-up + append write: postpone updating i_disksize
> >>> until after the zeroed partial block has been written back.
> >>
> >> I'm still going through how we are addressing no data=ordered so will
> >> get back on this in some time.
> >>
> >> Thanks,
> >> Ojaswin
> >>
> >>>
> >>> Online defragmentation does not yet support iomap; this can be resolved
> >>> separately in the future.
> >>>
> >>> Signed-off-by: Zhang Yi <yi.zhang@xxxxxxxxxx>
> >>> ---
> >>> fs/ext4/ext4_jbd2.h | 7 ++++++-
> >>> 1 file changed, 6 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> >>> index 63d17c5201b5..26999f173870 100644
> >>> --- a/fs/ext4/ext4_jbd2.h
> >>> +++ b/fs/ext4/ext4_jbd2.h
> >>> @@ -383,7 +383,12 @@ static inline int ext4_should_journal_data(struct inode *inode)
> >>>
> >>> static inline int ext4_should_order_data(struct inode *inode)
> >>> {
> >>> - return ext4_inode_journal_mode(inode) & EXT4_INODE_ORDERED_DATA_MODE;
> >>> + /*
> >>> + * inodes using the iomap buffered I/O path do not use the
> >>> + * data=ordered mode.
> >>> + */
> >>> + return !ext4_inode_buffered_iomap(inode) &&
> >>> + (ext4_inode_journal_mode(inode) & EXT4_INODE_ORDERED_DATA_MODE);
> >>> }
> >>>
> >>> static inline int ext4_should_writeback_data(struct inode *inode)
> >>> --
> >>> 2.52.0
> >>>
>