Re: [PATCH v3] ext4: drop s_writepages_rwsem around inline data handling in writepages

From: Jan Kara

Date: Thu Jun 18 2026 - 10:24:27 EST


On Mon 15-06-26 14:10:15, Yun Zhou wrote:
> ext4_do_writepages() calls ext4_destroy_inline_data() which acquires
> xattr_sem while s_writepages_rwsem is held (read). This creates a
> circular lock dependency:
>
> CPU0 CPU1
> ---- ----
> ext4_writepages()
> ext4_writepages_down_read()
> [holds s_writepages_rwsem]
> ext4_evict_inode()
> __ext4_mark_inode_dirty()
> ext4_expand_extra_isize_ea()
> ext4_xattr_block_set()
> [holds xattr_sem]
> iput(old_bh inode)
> write_inode_now()
> ext4_writepages()
> ext4_writepages_down_read()
> [BLOCKED on s_writepages_rwsem]
> ext4_do_writepages()
> ext4_destroy_inline_data()
> down_write(xattr_sem)
> [BLOCKED on xattr_sem]

You have fixed this differently (not expanding extra isize from
ext4_evict_inode()) and furthermore this scenario is really impossible
because you cannot be inside ext4_writepages() on inode that's undergoing
eviction. SO let's discard this patch.

Honza

>
> Fix by temporarily dropping s_writepages_rwsem for the entire inline
> data handling block, including the journal handle start/stop. The
> rwsem must be dropped before ext4_journal_start() -- not between
> journal_start and journal_stop -- to avoid a secondary deadlock with
> ext4_change_inode_journal_flag() which takes rwsem (write) and then
> calls jbd2_journal_lock_updates() waiting for active handles to stop.
>
> This is safe because:
>
> - This code runs before any block mapping or IO submission, so no
> writepages state depends on the rwsem being held at this point.
>
> - Inline data destruction is a one-way format transition (once cleared,
> EXT4_INODE_INLINE_DATA is never set again). The rwsem is
> re-acquired after journal_stop, ensuring format stability for the
> remainder of writepages.
>
> - The can_map flag identifies the ext4_writepages() path (holds rwsem)
> vs ext4_normal_submit_inode_data_buffers() (does not), so the
> drop/reacquire is skipped when the rwsem is not held.
>
> Also check the return value of ext4_destroy_inline_data() to avoid
> proceeding with an inconsistent inode format on failure.
>
> Reported-by: syzbot+bb2455d02bda0b5701e3@xxxxxxxxxxxxxxxxxxxxxxxxx
> Closes: https://syzkaller.appspot.com/bug?extid=bb2455d02bda0b5701e3
> Fixes: c8585c6fcaf2 ("ext4: fix races between changing inode journal mode and ext4_writepages")
> Signed-off-by: Yun Zhou <yun.zhou@xxxxxxxxxxxxx>
> ---
> v3: Drop s_writepages_rwsem before ext4_journal_start() and reacquire
> after ext4_journal_stop(), instead of dropping between journal_start
> and journal_stop as in v2. This avoids two issues identified in v2
> review:
> - memalloc_nofs_restore() in ext4_writepages_up_read() would clear
> PF_MEMALLOC_NOFS while the jbd2 handle is active.
> - Reacquiring s_writepages_rwsem while holding a handle creates an
> ABBA deadlock with ext4_change_inode_journal_flag() which takes
> the rwsem (write) then calls jbd2_journal_lock_updates().
>
> v2: Instead of moving inline data handling to ext4_writepages(),
> temporarily drop s_writepages_rwsem around ext4_destroy_inline_data()
> in ext4_do_writepages(). The move approach had a race where concurrent
> writes could create dirty pages with inline data after the early check,
> and unconditional destruction without dirty pages would lose data.
>
> v1: Moved inline data cleanup from ext4_do_writepages() to
> ext4_writepages() before acquiring s_writepages_rwsem.
>
> fs/ext4/inode.c | 31 ++++++++++++++++++++++++++-----
> 1 file changed, 26 insertions(+), 5 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index c2c2d6ac7f3d..cd7588a3fa45 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1694,6 +1694,9 @@ struct mpage_da_data {
> struct writeback_control *wbc;
> unsigned int can_map:1; /* Can writepages call map blocks? */
>
> + /* Saved memalloc context from ext4_writepages_down_read() */
> + int alloc_ctx;
> +
> /* These are internal state of ext4_do_writepages() */
> loff_t start_pos; /* The start pos to write */
> loff_t next_pos; /* Current pos to examine */
> @@ -2816,16 +2819,35 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)
> * we'd better clear the inline data here.
> */
> if (ext4_has_inline_data(inode)) {
> - /* Just inode will be modified... */
> + /*
> + * Temporarily drop s_writepages_rwsem because
> + * ext4_destroy_inline_data() acquires xattr_sem, which has
> + * a higher lock ordering rank. Holding both would create a
> + * circular dependency with ext4_xattr_block_set() -> iput()
> + * -> ext4_writepages() -> s_writepages_rwsem.
> + *
> + * Drop the rwsem before starting the journal handle to also
> + * avoid a deadlock with ext4_change_inode_journal_flag(),
> + * which takes rwsem (write) then jbd2_journal_lock_updates().
> + */
> + if (mpd->can_map)
> + ext4_writepages_up_read(inode->i_sb, mpd->alloc_ctx);
> handle = ext4_journal_start(inode, EXT4_HT_INODE, 1);
> if (IS_ERR(handle)) {
> + if (mpd->can_map)
> + mpd->alloc_ctx =
> + ext4_writepages_down_read(inode->i_sb);
> ret = PTR_ERR(handle);
> goto out_writepages;
> }
> BUG_ON(ext4_test_inode_state(inode,
> EXT4_STATE_MAY_INLINE_DATA));
> - ext4_destroy_inline_data(handle, inode);
> + ret = ext4_destroy_inline_data(handle, inode);
> ext4_journal_stop(handle);
> + if (mpd->can_map)
> + mpd->alloc_ctx = ext4_writepages_down_read(inode->i_sb);
> + if (ret)
> + goto out_writepages;
> }
>
> /*
> @@ -3032,13 +3054,12 @@ static int ext4_writepages(struct address_space *mapping,
> .can_map = 1,
> };
> int ret;
> - int alloc_ctx;
>
> ret = ext4_emergency_state(sb);
> if (unlikely(ret))
> return ret;
>
> - alloc_ctx = ext4_writepages_down_read(sb);
> + mpd.alloc_ctx = ext4_writepages_down_read(sb);
> ret = ext4_do_writepages(&mpd);
> /*
> * For data=journal writeback we could have come across pages marked
> @@ -3047,7 +3068,7 @@ static int ext4_writepages(struct address_space *mapping,
> */
> if (!ret && mpd.journalled_more_data)
> ret = ext4_do_writepages(&mpd);
> - ext4_writepages_up_read(sb, alloc_ctx);
> + ext4_writepages_up_read(sb, mpd.alloc_ctx);
>
> return ret;
> }
> --
> 2.43.0
>
--
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR