Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
From: Zhang Yi
Date: Fri Mar 07 2025 - 03:43:48 EST
On 2025/3/7 16:13, Ojaswin Mujoo wrote:
> On Fri, Mar 07, 2025 at 12:04:26PM +0530, Ojaswin Mujoo wrote:
>> On Fri, Mar 07, 2025 at 10:49:28AM +0800, Zhang Yi wrote:
>>> On 2025/3/6 22:28, Ojaswin Mujoo wrote:
>>>> Presently we always BUG_ON if trying to start a transaction on a journal marked
>>>> with JBD2_UNMOUNT, since this should never happen. However, while ltp running
>>>> stress tests, it was observed that in case of some error handling paths, it is
>>>> possible for update_super_work to start a transaction after the journal is
>>>> destroyed eg:
>>>>
>>>> (umount)
>>>> ext4_kill_sb
>>>> kill_block_super
>>>> generic_shutdown_super
>>>> sync_filesystem /* commits all txns */
>>>> evict_inodes
>>>> /* might start a new txn */
>>>> ext4_put_super
>>>> flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
>>>> jbd2_journal_destroy
>>>> journal_kill_thread
>>>> journal->j_flags |= JBD2_UNMOUNT;
>>>> jbd2_journal_commit_transaction
>>>> jbd2_journal_get_descriptor_buffer
>>>> jbd2_journal_bmap
>>>> ext4_journal_bmap
>>>> ext4_map_blocks
>>>> ...
>>>> ext4_inode_error
>>>> ext4_handle_error
>>>> schedule_work(&sbi->s_sb_upd_work)
>>>>
>>>> /* work queue kicks in */
>>>> update_super_work
>>>> jbd2_journal_start
>>>> start_this_handle
>>>> BUG_ON(journal->j_flags &
>>>> JBD2_UNMOUNT)
>>>>
>>>> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
>>>> destroying only do a journaled (and deferred) update of sb if this flag is not
>>>> set. Otherwise, just fallback to an un-journaled commit.
>>>>
>>>> We set sbi->s_journal_destroying = true only after all the FS updates are done
>>>> during ext4_put_super() (except a running transaction that will get commited
>>>> during jbd2_journal_destroy()). After this point, it is safe to commit the sb
>>>> outside the journal as it won't race with a journaled update (refer
>>>> 2d01ddc86606).
>>>>
>>>> Also, we don't need a similar check in ext4_grp_locked_error since it is only
>>>> called from mballoc and AFAICT it would be always valid to schedule work here.
>>>>
>>>> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
>>>> Reported-by: Mahesh Kumar <maheshkumar657g@xxxxxxxxx>
>>>> Suggested-by: Jan Kara <jack@xxxxxxx>
>>>> Signed-off-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx>
>>>> ---
>>>> fs/ext4/ext4.h | 2 ++
>>>> fs/ext4/ext4_jbd2.h | 8 ++++++++
>>>> fs/ext4/super.c | 4 +++-
>>>> 3 files changed, 13 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>>> index 2b7d781bfcad..d48e93bd5690 100644
>>>> --- a/fs/ext4/ext4.h
>>>> +++ b/fs/ext4/ext4.h
>>>> @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
>>>> */
>>>> struct work_struct s_sb_upd_work;
>>>>
>>>> + bool s_journal_destorying;
>>>> +
>>>> /* Atomic write unit values in bytes */
>>>> unsigned int s_awu_min;
>>>> unsigned int s_awu_max;
>>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
>>>> index 9b3c9df02a39..6bd3ca84410d 100644
>>>> --- a/fs/ext4/ext4_jbd2.h
>>>> +++ b/fs/ext4/ext4_jbd2.h
>>>> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
>>>> {
>>>> int err = 0;
>>>>
>>>> + /*
>>>> + * At this point all pending FS updates should be done except a possible
>>>> + * running transaction (which will commit in jbd2_journal_destroy). It
>>>> + * is now safe for any new errors to directly commit superblock rather
>>>> + * than going via journal.
>>>> + */
>>>> + sbi->s_journal_destorying = true;
>>>> +
>>>
>>> Hi, Ojaswin!
>>>
>>> I'm afraid you still need to flush the superblock update work here,
>>> otherwise I guess the race condition you mentioned in v1 could still
>>> occur.
>>>
>>> ext4_put_super()
>>> flush_work(&sbi->s_sb_upd_work)
>>>
>>> **kjournald2**
>>> jbd2_journal_commit_transaction()
>>> ...
>>> ext4_inode_error()
>>> /* JBD2_UNMOUNT not set */
>>> schedule_work(s_sb_upd_work)
>>>
>>> **workqueue**
>>> update_super_work
>>> /* s_journal_destorying is not set */
>>> if (journal && !s_journal_destorying)
>>>
>>> ext4_journal_destroy()
>>> /* set s_journal_destorying */
>>> sbi->s_journal_destorying = true;
>>> jbd2_journal_destroy()
>>> journal->j_flags |= JBD2_UNMOUNT;
>>>
>>> jbd2_journal_start()
>>> start_this_handle()
>>> BUG_ON(JBD2_UNMOUNT)
>>>
>>> Thanks,
>>> Yi.
>> Hi Yi,
>>
>> Yes you are right, somehow missed this edge case :(
>>
>> Alright then, we have to move out sbi->s_journal_destroying outside the
>> helper. Just wondering if I should still let it be in
>> ext4_journal_destroy and just add an extra s_journal_destroying = false
>> before schedule_work(s_sb_upd_work), because it makes sense.
>>
>> Okay let me give it some thought but thanks for pointing this out!
>>
>> Regards,
>> ojaswin
>
> Okay so thinking about it a bit more, I see you also suggested to flush
> the work after marking sbi->s_journal_destroying. But will that solve
> it?
>
> ext4_put_super()
> flush_work(&sbi->s_sb_upd_work)
>
> **kjournald2**
> jbd2_journal_commit_transaction()
> ...
> ext4_inode_error()
> /* JBD2_UNMOUNT not set */
> schedule_work(s_sb_upd_work)
>
> **workqueue**
> update_super_work
> /* s_journal_destorying is not set */
> if (journal && !s_journal_destorying)
>
> ext4_journal_destroy()
> /* set s_journal_destorying */
> sbi->s_journal_destorying = true;
> flush_work(&sbi->s_sb_upd_work)
> schedule_work()
^^^^^^^^^^^^^^^
where does this come from?
After this flush_work, we can guarantee that the running s_sb_upd_work
finishes before we set JBD2_UNMOUNT. Additionally, the journal will
not commit transaction or call schedule_work() again because it has
been aborted due to the previous error. Am I missing something?
Thanks,
Yi.
> jbd2_journal_destroy()
> journal->j_flags |= JBD2_UNMOUNT;
>
> jbd2_journal_start()
> start_this_handle()
> BUG_ON(JBD2_UNMOUNT)
>
>
> Seems like these edge cases keep sprouting up :)
>
> As for the fix, how about we do something like this:
>
> ext4_put_super()
>
> flush_work(&sbi->s_sb_upd_work)
> destroy_workqueue(sbi->rsv_conversion_wq);
>
> ext4_journal_destroy()
> /* set s_journal_destorying */
> sbi->s_journal_destorying = true;
>
> /* trigger a commit and wait for it to complete */
>
> flush_work(&sbi->s_sb_upd_work)
>
> jbd2_journal_destroy()
> journal->j_flags |= JBD2_UNMOUNT;
>
> jbd2_journal_start()
> start_this_handle()
> BUG_ON(JBD2_UNMOUNT)
>
> Still giving this codepath some thought but seems like this might just
> be enough to fix the race. Thoughts on this?
>
> Regards,
> ojaswin
>
>>>
>>>> err = jbd2_journal_destroy(journal);
>>>> sbi->s_journal = NULL;
>>>>
>>>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>>>> index 8ad664d47806..31552cf0519a 100644
>>>> --- a/fs/ext4/super.c
>>>> +++ b/fs/ext4/super.c
>>>> @@ -706,7 +706,7 @@ static void ext4_handle_error(struct super_block *sb, bool force_ro, int error,
>>>> * constraints, it may not be safe to do it right here so we
>>>> * defer superblock flushing to a workqueue.
>>>> */
>>>> - if (continue_fs && journal)
>>>> + if (continue_fs && journal && !EXT4_SB(sb)->s_journal_destorying)
>>>> schedule_work(&EXT4_SB(sb)->s_sb_upd_work);
>>>> else
>>>> ext4_commit_super(sb);
>>>> @@ -5311,6 +5311,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
>>>> spin_lock_init(&sbi->s_error_lock);
>>>> INIT_WORK(&sbi->s_sb_upd_work, update_super_work);
>>>>
>>>> + sbi->s_journal_destorying = false;
>>>> +
>>>> err = ext4_group_desc_init(sb, es, logical_sb_block, &first_not_zeroed);
>>>> if (err)
>>>> goto failed_mount3;
>>>
>