Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying

From: Zhang Yi
Date: Fri Mar 07 2025 - 21:57:34 EST


On 2025/3/8 1:26, Ojaswin Mujoo wrote:
> On Fri, Mar 07, 2025 at 08:36:08PM +0800, Zhang Yi wrote:
>> On 2025/3/7 18:27, Ojaswin Mujoo wrote:
>>> On Fri, Mar 07, 2025 at 04:43:24PM +0800, Zhang Yi wrote:
>>>> On 2025/3/7 16:13, Ojaswin Mujoo wrote:
>>>>> On Fri, Mar 07, 2025 at 12:04:26PM +0530, Ojaswin Mujoo wrote:
>>>>>> On Fri, Mar 07, 2025 at 10:49:28AM +0800, Zhang Yi wrote:
>>>>>>> On 2025/3/6 22:28, Ojaswin Mujoo wrote:
>>>>>>>> Presently we always BUG_ON if trying to start a transaction on a journal marked
>>>>>>>> with JBD2_UNMOUNT, since this should never happen. However, while ltp running
>>>>>>>> stress tests, it was observed that in case of some error handling paths, it is
>>>>>>>> possible for update_super_work to start a transaction after the journal is
>>>>>>>> destroyed eg:
>>>>>>>>
>>>>>>>> (umount)
>>>>>>>> ext4_kill_sb
>>>>>>>> kill_block_super
>>>>>>>> generic_shutdown_super
>>>>>>>> sync_filesystem /* commits all txns */
>>>>>>>> evict_inodes
>>>>>>>> /* might start a new txn */
>>>>>>>> ext4_put_super
>>>>>>>> flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
>>>>>>>> jbd2_journal_destroy
>>>>>>>> journal_kill_thread
>>>>>>>> journal->j_flags |= JBD2_UNMOUNT;
>>>>>>>> jbd2_journal_commit_transaction
>>>>>>>> jbd2_journal_get_descriptor_buffer
>>>>>>>> jbd2_journal_bmap
>>>>>>>> ext4_journal_bmap
>>>>>>>> ext4_map_blocks
>>>>>>>> ...
>>>>>>>> ext4_inode_error
>>>>>>>> ext4_handle_error
>>>>>>>> schedule_work(&sbi->s_sb_upd_work)
>>>>>>>>
>>>>>>>> /* work queue kicks in */
>>>>>>>> update_super_work
>>>>>>>> jbd2_journal_start
>>>>>>>> start_this_handle
>>>>>>>> BUG_ON(journal->j_flags &
>>>>>>>> JBD2_UNMOUNT)
>>>>>>>>
>>>>>>>> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
>>>>>>>> destroying only do a journaled (and deferred) update of sb if this flag is not
>>>>>>>> set. Otherwise, just fallback to an un-journaled commit.
>>>>>>>>
>>>>>>>> We set sbi->s_journal_destroying = true only after all the FS updates are done
>>>>>>>> during ext4_put_super() (except a running transaction that will get commited
>>>>>>>> during jbd2_journal_destroy()). After this point, it is safe to commit the sb
>>>>>>>> outside the journal as it won't race with a journaled update (refer
>>>>>>>> 2d01ddc86606).
>>>>>>>>
>>>>>>>> Also, we don't need a similar check in ext4_grp_locked_error since it is only
>>>>>>>> called from mballoc and AFAICT it would be always valid to schedule work here.
>>>>>>>>
>>>>>>>> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
>>>>>>>> Reported-by: Mahesh Kumar <maheshkumar657g@xxxxxxxxx>
>>>>>>>> Suggested-by: Jan Kara <jack@xxxxxxx>
>>>>>>>> Signed-off-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx>
>>>>>>>> ---
>>>>>>>> fs/ext4/ext4.h | 2 ++
>>>>>>>> fs/ext4/ext4_jbd2.h | 8 ++++++++
>>>>>>>> fs/ext4/super.c | 4 +++-
>>>>>>>> 3 files changed, 13 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>>>>>>> index 2b7d781bfcad..d48e93bd5690 100644
>>>>>>>> --- a/fs/ext4/ext4.h
>>>>>>>> +++ b/fs/ext4/ext4.h
>>>>>>>> @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
>>>>>>>> */
>>>>>>>> struct work_struct s_sb_upd_work;
>>>>>>>>
>>>>>>>> + bool s_journal_destorying;
>>>>>>>> +
>>>>>>>> /* Atomic write unit values in bytes */
>>>>>>>> unsigned int s_awu_min;
>>>>>>>> unsigned int s_awu_max;
>>>>>>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
>>>>>>>> index 9b3c9df02a39..6bd3ca84410d 100644
>>>>>>>> --- a/fs/ext4/ext4_jbd2.h
>>>>>>>> +++ b/fs/ext4/ext4_jbd2.h
>>>>>>>> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
>>>>>>>> {
>>>>>>>> int err = 0;
>>>>>>>>
>>>>>>>> + /*
>>>>>>>> + * At this point all pending FS updates should be done except a possible
>>>>>>>> + * running transaction (which will commit in jbd2_journal_destroy). It
>>>>>>>> + * is now safe for any new errors to directly commit superblock rather
>>>>>>>> + * than going via journal.
>>>>>>>> + */
>>>>>>>> + sbi->s_journal_destorying = true;
>>>>>>>> +
>>>>>>>
>>>>>>> Hi, Ojaswin!
>>>>>>>
>>>>>>> I'm afraid you still need to flush the superblock update work here,
>>>>>>> otherwise I guess the race condition you mentioned in v1 could still
>>>>>>> occur.
>>>>>>>
>>>>>>> ext4_put_super()
>>>>>>> flush_work(&sbi->s_sb_upd_work)
>>>>>>>
>>>>>>> **kjournald2**
>>>>>>> jbd2_journal_commit_transaction()
>>>>>>> ...
>>>>>>> ext4_inode_error()
>>>>>>> /* JBD2_UNMOUNT not set */
>>>>>>> schedule_work(s_sb_upd_work)
>>>>>>>
>>>>>>> **workqueue**
>>>>>>> update_super_work
>>>>>>> /* s_journal_destorying is not set */
>>>>>>> if (journal && !s_journal_destorying)
>>>>>>>
>>>>>>> ext4_journal_destroy()
>>>>>>> /* set s_journal_destorying */
>>>>>>> sbi->s_journal_destorying = true;
>>>>>>> jbd2_journal_destroy()
>>>>>>> journal->j_flags |= JBD2_UNMOUNT;
>>>>>>>
>>>>>>> jbd2_journal_start()
>>>>>>> start_this_handle()
>>>>>>> BUG_ON(JBD2_UNMOUNT)
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Yi.
>>>>>> Hi Yi,
>>>>>>
>>>>>> Yes you are right, somehow missed this edge case :(
>>>>>>
>>>>>> Alright then, we have to move out sbi->s_journal_destroying outside the
>>>>>> helper. Just wondering if I should still let it be in
>>>>>> ext4_journal_destroy and just add an extra s_journal_destroying = false
>>>>>> before schedule_work(s_sb_upd_work), because it makes sense.
>>>>>>
>>>>>> Okay let me give it some thought but thanks for pointing this out!
>>>>>>
>>>>>> Regards,
>>>>>> ojaswin
>>>>>
>>>>> Okay so thinking about it a bit more, I see you also suggested to flush
>>>>> the work after marking sbi->s_journal_destroying. But will that solve
>>>>> it?
>>>>>
>>>>> ext4_put_super()
>>>>> flush_work(&sbi->s_sb_upd_work)
>>>>>
>>>>> **kjournald2**
>>>>> jbd2_journal_commit_transaction()
>>>>> ...
>>>>> ext4_inode_error()
>>>>> /* JBD2_UNMOUNT not set */
>>>>> schedule_work(s_sb_upd_work)
>>>>>
>>>>> **workqueue**
>>>>> update_super_work
>>>>> /* s_journal_destorying is not set */
>>>>> if (journal && !s_journal_destorying)
>>>>>
>>>>> ext4_journal_destroy()
>>>>> /* set s_journal_destorying */
>>>>> sbi->s_journal_destorying = true;
>>>>> flush_work(&sbi->s_sb_upd_work)
>>>>> schedule_work()
>>>> ^^^^^^^^^^^^^^^
>>>> where does this come from?
>>>>
>>>> After this flush_work, we can guarantee that the running s_sb_upd_work
>>>> finishes before we set JBD2_UNMOUNT. Additionally, the journal will
>>>> not commit transaction or call schedule_work() again because it has
>>>> been aborted due to the previous error. Am I missing something?
>>>>
>>>> Thanks,
>>>> Yi.
>>>
>>> Hmm, so I am thinking of a corner case in ext4_handle_error() where
>>>
>>> if(journal && !is_journal_destroying)
>>>
>>> is computed but schedule_work() not called yet, which is possible cause
>>> the cmp followed by jump is not atomic in nature. If the schedule_work
>>> is only called after we have done the flush then we end up with this:
>>>
>>> if (journal && !s_journal_destorying)
>>> ext4_journal_destroy()
>>> /* set s_journal_destorying */
>>> sbi->s_journal_destorying = true;
>>> flush_work(&sbi->s_sb_upd_work)
>>> schedule_work()
>>>
>>> Which is possible IMO, although the window is tiny.
>>
>> Yeah, right!
>> Sorry for misread the location where you add the "!s_journal_destorying"
>> check, the graph I provided was in update_super_work(), which was wrong.
>
> Oh right, I also misread your trace but yes as discussed, even
>
> sbi->s_journal_destorying = true;
> flush_work()
> jbd2_journal_destroy()
>
> doesn't work.
>
>> The right one should be:
>>
>> ext4_put_super()
>> flush_work(&sbi->s_sb_upd_work)
>>
>> **kjournald2**
>> jbd2_journal_commit_transaction()
>> ...
>> ext4_inode_error()
>> /* s_journal_destorying is not set */
>> if (journal && !s_journal_destorying)
>> (schedule_work(s_sb_upd_work)) //can be here
>>
>> ext4_journal_destroy()
>> /* set s_journal_destorying */
>> sbi->s_journal_destorying = true;
>> jbd2_journal_destroy()
>> journal->j_flags |= JBD2_UNMOUNT;
>>
>> (schedule_work(s_sb_upd_work)) //also can be here
>>
>> **workqueue**
>> update_super_work()
>> journal = sbi->s_journal //get journal
>> kfree(journal)
>> jbd2_journal_start(journal) //journal UAF
>> start_this_handle()
>> BUG_ON(JBD2_UNMOUNT) //bugon here
>>
>>
>> So there are two problems here, the first one is the 'journal' UAF,
>> the second one is triggering JBD2_UNMOUNT flag BUGON.
>
> Indeed, there's a possible UAF here as well.
>
>>
>>>>>
>>>>> As for the fix, how about we do something like this:
>>>>>
>>>>> ext4_put_super()
>>>>>
>>>>> flush_work(&sbi->s_sb_upd_work)
>>>>> destroy_workqueue(sbi->rsv_conversion_wq);
>>>>>
>>>>> ext4_journal_destroy()
>>>>> /* set s_journal_destorying */
>>>>> sbi->s_journal_destorying = true;
>>>>>
>>>>> /* trigger a commit and wait for it to complete */
>>>>>
>>>>> flush_work(&sbi->s_sb_upd_work)
>>>>>
>>>>> jbd2_journal_destroy()
>>>>> journal->j_flags |= JBD2_UNMOUNT;
>>>>>
>>>>> jbd2_journal_start()
>>>>> start_this_handle()
>>>>> BUG_ON(JBD2_UNMOUNT)
>>>>>
>>>>> Still giving this codepath some thought but seems like this might just
>>>>> be enough to fix the race. Thoughts on this?
>>>>>
>>
>> I think this solution should work, the forced commit and flush_work()
>> should ensure that the last transaction is committed and that the
>> potential work is done.
>>
>> Besides, the s_journal_destorying flag is set and check concurrently
>> now, so we need WRITE_ONCE() and READ_ONCE() for it. Besides, what
>> about adding a new flag into sbi->s_mount_state instead of adding
>> new s_journal_destorying?
>
> Right, that makes sence. I will incorporate these changes in the next
> revision.
>

Think about this again, it seems that we no longer need the destroying
flag. Because we force to commit and wait for the **last** transaction to
complete, and the flush work should also ensure that the last sb_update
work to complete. Regardless of whether it starts a new handle in the
last update_super_work(), it will not commit since the journal should
have aborted. What are your thoughts?

ext4_put_super()
flush_work(&sbi->s_sb_upd_work)
destroy_workqueue(sbi->rsv_conversion_wq)

ext4_journal_destroy()
/* trigger a commit (it will commit the last trnasaction) */

**kjournald2**
jbd2_journal_commit_transaction()
...
ext4_inode_error()
schedule_work(s_sb_upd_work))

**workqueue**
update_super_work()
jbd2_journal_start(journal)
start_this_handle()
//This new trans will
//not be committed.

jbd2_journal_abort()

/* wait for it to complete */

flush_work(&sbi->s_sb_upd_work)
jbd2_journal_destroy()
journal->j_flags |= JBD2_UNMOUNT;
jbd2_journal_commit_transaction() //it will commit nothing

Thanks,
Yi.