Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
From: IBM
Date: Sat Mar 08 2025 - 05:10:27 EST
Zhang Yi <yi.zhang@xxxxxxxxxxxxxxx> writes:
> On 2025/3/8 1:26, Ojaswin Mujoo wrote:
>> On Fri, Mar 07, 2025 at 08:36:08PM +0800, Zhang Yi wrote:
>>> On 2025/3/7 18:27, Ojaswin Mujoo wrote:
>>>> On Fri, Mar 07, 2025 at 04:43:24PM +0800, Zhang Yi wrote:
>>>>> On 2025/3/7 16:13, Ojaswin Mujoo wrote:
>>>>>> On Fri, Mar 07, 2025 at 12:04:26PM +0530, Ojaswin Mujoo wrote:
>>>>>>> On Fri, Mar 07, 2025 at 10:49:28AM +0800, Zhang Yi wrote:
>>>>>>>> On 2025/3/6 22:28, Ojaswin Mujoo wrote:
>>>>>>>>> Presently we always BUG_ON if trying to start a transaction on a journal marked
>>>>>>>>> with JBD2_UNMOUNT, since this should never happen. However, while ltp running
>>>>>>>>> stress tests, it was observed that in case of some error handling paths, it is
>>>>>>>>> possible for update_super_work to start a transaction after the journal is
>>>>>>>>> destroyed eg:
>>>>>>>>>
>>>>>>>>> (umount)
>>>>>>>>> ext4_kill_sb
>>>>>>>>> kill_block_super
>>>>>>>>> generic_shutdown_super
>>>>>>>>> sync_filesystem /* commits all txns */
>>>>>>>>> evict_inodes
>>>>>>>>> /* might start a new txn */
>>>>>>>>> ext4_put_super
>>>>>>>>> flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
>>>>>>>>> jbd2_journal_destroy
>>>>>>>>> journal_kill_thread
>>>>>>>>> journal->j_flags |= JBD2_UNMOUNT;
>>>>>>>>> jbd2_journal_commit_transaction
>>>>>>>>> jbd2_journal_get_descriptor_buffer
>>>>>>>>> jbd2_journal_bmap
>>>>>>>>> ext4_journal_bmap
>>>>>>>>> ext4_map_blocks
>>>>>>>>> ...
>>>>>>>>> ext4_inode_error
>>>>>>>>> ext4_handle_error
>>>>>>>>> schedule_work(&sbi->s_sb_upd_work)
>>>>>>>>>
>>>>>>>>> /* work queue kicks in */
>>>>>>>>> update_super_work
>>>>>>>>> jbd2_journal_start
>>>>>>>>> start_this_handle
>>>>>>>>> BUG_ON(journal->j_flags &
>>>>>>>>> JBD2_UNMOUNT)
>>>>>>>>>
>>>>>>>>> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
>>>>>>>>> destroying only do a journaled (and deferred) update of sb if this flag is not
>>>>>>>>> set. Otherwise, just fallback to an un-journaled commit.
>>>>>>>>>
>>>>>>>>> We set sbi->s_journal_destroying = true only after all the FS updates are done
>>>>>>>>> during ext4_put_super() (except a running transaction that will get commited
>>>>>>>>> during jbd2_journal_destroy()). After this point, it is safe to commit the sb
>>>>>>>>> outside the journal as it won't race with a journaled update (refer
>>>>>>>>> 2d01ddc86606).
>>>>>>>>>
>>>>>>>>> Also, we don't need a similar check in ext4_grp_locked_error since it is only
>>>>>>>>> called from mballoc and AFAICT it would be always valid to schedule work here.
>>>>>>>>>
>>>>>>>>> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
>>>>>>>>> Reported-by: Mahesh Kumar <maheshkumar657g@xxxxxxxxx>
>>>>>>>>> Suggested-by: Jan Kara <jack@xxxxxxx>
>>>>>>>>> Signed-off-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx>
>>>>>>>>> ---
>>>>>>>>> fs/ext4/ext4.h | 2 ++
>>>>>>>>> fs/ext4/ext4_jbd2.h | 8 ++++++++
>>>>>>>>> fs/ext4/super.c | 4 +++-
>>>>>>>>> 3 files changed, 13 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>>>>>>>> index 2b7d781bfcad..d48e93bd5690 100644
>>>>>>>>> --- a/fs/ext4/ext4.h
>>>>>>>>> +++ b/fs/ext4/ext4.h
>>>>>>>>> @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
>>>>>>>>> */
>>>>>>>>> struct work_struct s_sb_upd_work;
>>>>>>>>>
>>>>>>>>> + bool s_journal_destorying;
>>>>>>>>> +
>>>>>>>>> /* Atomic write unit values in bytes */
>>>>>>>>> unsigned int s_awu_min;
>>>>>>>>> unsigned int s_awu_max;
>>>>>>>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
>>>>>>>>> index 9b3c9df02a39..6bd3ca84410d 100644
>>>>>>>>> --- a/fs/ext4/ext4_jbd2.h
>>>>>>>>> +++ b/fs/ext4/ext4_jbd2.h
>>>>>>>>> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
>>>>>>>>> {
>>>>>>>>> int err = 0;
>>>>>>>>>
>>>>>>>>> + /*
>>>>>>>>> + * At this point all pending FS updates should be done except a possible
>>>>>>>>> + * running transaction (which will commit in jbd2_journal_destroy). It
>>>>>>>>> + * is now safe for any new errors to directly commit superblock rather
>>>>>>>>> + * than going via journal.
>>>>>>>>> + */
>>>>>>>>> + sbi->s_journal_destorying = true;
>>>>>>>>> +
>>>>>>>>
>>>>>>>> Hi, Ojaswin!
>>>>>>>>
>>>>>>>> I'm afraid you still need to flush the superblock update work here,
>>>>>>>> otherwise I guess the race condition you mentioned in v1 could still
>>>>>>>> occur.
>>>>>>>>
>>>>>>>> ext4_put_super()
>>>>>>>> flush_work(&sbi->s_sb_upd_work)
>>>>>>>>
>>>>>>>> **kjournald2**
>>>>>>>> jbd2_journal_commit_transaction()
>>>>>>>> ...
>>>>>>>> ext4_inode_error()
>>>>>>>> /* JBD2_UNMOUNT not set */
>>>>>>>> schedule_work(s_sb_upd_work)
>>>>>>>>
>>>>>>>> **workqueue**
>>>>>>>> update_super_work
>>>>>>>> /* s_journal_destorying is not set */
>>>>>>>> if (journal && !s_journal_destorying)
>>>>>>>>
>>>>>>>> ext4_journal_destroy()
>>>>>>>> /* set s_journal_destorying */
>>>>>>>> sbi->s_journal_destorying = true;
>>>>>>>> jbd2_journal_destroy()
>>>>>>>> journal->j_flags |= JBD2_UNMOUNT;
>>>>>>>>
>>>>>>>> jbd2_journal_start()
>>>>>>>> start_this_handle()
>>>>>>>> BUG_ON(JBD2_UNMOUNT)
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Yi.
>>>>>>> Hi Yi,
>>>>>>>
>>>>>>> Yes you are right, somehow missed this edge case :(
>>>>>>>
>>>>>>> Alright then, we have to move out sbi->s_journal_destroying outside the
>>>>>>> helper. Just wondering if I should still let it be in
>>>>>>> ext4_journal_destroy and just add an extra s_journal_destroying = false
>>>>>>> before schedule_work(s_sb_upd_work), because it makes sense.
>>>>>>>
>>>>>>> Okay let me give it some thought but thanks for pointing this out!
>>>>>>>
>>>>>>> Regards,
>>>>>>> ojaswin
>>>>>>
>>>>>> Okay so thinking about it a bit more, I see you also suggested to flush
>>>>>> the work after marking sbi->s_journal_destroying. But will that solve
>>>>>> it?
>>>>>>
>>>>>> ext4_put_super()
>>>>>> flush_work(&sbi->s_sb_upd_work)
>>>>>>
>>>>>> **kjournald2**
>>>>>> jbd2_journal_commit_transaction()
>>>>>> ...
>>>>>> ext4_inode_error()
>>>>>> /* JBD2_UNMOUNT not set */
>>>>>> schedule_work(s_sb_upd_work)
>>>>>>
>>>>>> **workqueue**
>>>>>> update_super_work
>>>>>> /* s_journal_destorying is not set */
>>>>>> if (journal && !s_journal_destorying)
>>>>>>
>>>>>> ext4_journal_destroy()
>>>>>> /* set s_journal_destorying */
>>>>>> sbi->s_journal_destorying = true;
>>>>>> flush_work(&sbi->s_sb_upd_work)
>>>>>> schedule_work()
>>>>> ^^^^^^^^^^^^^^^
>>>>> where does this come from?
>>>>>
>>>>> After this flush_work, we can guarantee that the running s_sb_upd_work
>>>>> finishes before we set JBD2_UNMOUNT. Additionally, the journal will
>>>>> not commit transaction or call schedule_work() again because it has
>>>>> been aborted due to the previous error. Am I missing something?
>>>>>
>>>>> Thanks,
>>>>> Yi.
>>>>
>>>> Hmm, so I am thinking of a corner case in ext4_handle_error() where
>>>>
>>>> if(journal && !is_journal_destroying)
>>>>
>>>> is computed but schedule_work() not called yet, which is possible cause
>>>> the cmp followed by jump is not atomic in nature. If the schedule_work
>>>> is only called after we have done the flush then we end up with this:
>>>>
>>>> if (journal && !s_journal_destorying)
>>>> ext4_journal_destroy()
>>>> /* set s_journal_destorying */
>>>> sbi->s_journal_destorying = true;
>>>> flush_work(&sbi->s_sb_upd_work)
>>>> schedule_work()
>>>>
>>>> Which is possible IMO, although the window is tiny.
>>>
>>> Yeah, right!
>>> Sorry for misread the location where you add the "!s_journal_destorying"
>>> check, the graph I provided was in update_super_work(), which was wrong.
>>
>> Oh right, I also misread your trace but yes as discussed, even
>>
>> sbi->s_journal_destorying = true;
>> flush_work()
>> jbd2_journal_destroy()
>>
>> doesn't work.
>>
>>> The right one should be:
>>>
>>> ext4_put_super()
>>> flush_work(&sbi->s_sb_upd_work)
>>>
>>> **kjournald2**
>>> jbd2_journal_commit_transaction()
>>> ...
>>> ext4_inode_error()
>>> /* s_journal_destorying is not set */
>>> if (journal && !s_journal_destorying)
>>> (schedule_work(s_sb_upd_work)) //can be here
>>>
>>> ext4_journal_destroy()
>>> /* set s_journal_destorying */
>>> sbi->s_journal_destorying = true;
>>> jbd2_journal_destroy()
>>> journal->j_flags |= JBD2_UNMOUNT;
>>>
>>> (schedule_work(s_sb_upd_work)) //also can be here
>>>
>>> **workqueue**
>>> update_super_work()
>>> journal = sbi->s_journal //get journal
>>> kfree(journal)
>>> jbd2_journal_start(journal) //journal UAF
>>> start_this_handle()
>>> BUG_ON(JBD2_UNMOUNT) //bugon here
>>>
>>>
>>> So there are two problems here, the first one is the 'journal' UAF,
>>> the second one is triggering JBD2_UNMOUNT flag BUGON.
>>
>> Indeed, there's a possible UAF here as well.
>>
>>>
>>>>>>
>>>>>> As for the fix, how about we do something like this:
>>>>>>
>>>>>> ext4_put_super()
>>>>>>
>>>>>> flush_work(&sbi->s_sb_upd_work)
>>>>>> destroy_workqueue(sbi->rsv_conversion_wq);
>>>>>>
>>>>>> ext4_journal_destroy()
>>>>>> /* set s_journal_destorying */
>>>>>> sbi->s_journal_destorying = true;
>>>>>>
>>>>>> /* trigger a commit and wait for it to complete */
>>>>>>
>>>>>> flush_work(&sbi->s_sb_upd_work)
>>>>>>
>>>>>> jbd2_journal_destroy()
>>>>>> journal->j_flags |= JBD2_UNMOUNT;
>>>>>>
>>>>>> jbd2_journal_start()
>>>>>> start_this_handle()
>>>>>> BUG_ON(JBD2_UNMOUNT)
>>>>>>
>>>>>> Still giving this codepath some thought but seems like this might just
>>>>>> be enough to fix the race. Thoughts on this?
>>>>>>
>>>
>>> I think this solution should work, the forced commit and flush_work()
>>> should ensure that the last transaction is committed and that the
>>> potential work is done.
>>>
>>> Besides, the s_journal_destorying flag is set and check concurrently
>>> now, so we need WRITE_ONCE() and READ_ONCE() for it. Besides, what
>>> about adding a new flag into sbi->s_mount_state instead of adding
>>> new s_journal_destorying?
>>
>> Right, that makes sence. I will incorporate these changes in the next
>> revision.
>>
>
> Think about this again, it seems that we no longer need the destroying
> flag. Because we force to commit and wait for the **last** transaction to
> complete, and the flush work should also ensure that the last sb_update
> work to complete. Regardless of whether it starts a new handle in the
> last update_super_work(), it will not commit since the journal should
> have aborted. What are your thoughts?
>
I think the confusion maybe coming because v2 patch isn't where we
discussed to put the s_journal_destroying to true, in this thread [1]
[1]: https://lore.kernel.org/linux-ext4/jnxpphuradrsf73cxfmohfu7wwwckihtulw6ovsitddgt5pqkg@2uoejkr66qnl/
> ext4_put_super()
+ sbi->s_journal_destroying = true;
We should add s_journal_destroying to true before calling for
flush_work.
> flush_work(&sbi->s_sb_upd_work)
After the above flush work is complete, we will always check if
s_journal_destroying is set. If yes, then we should never schedule the
sb update work
> destroy_workqueue(sbi->rsv_conversion_wq)
>
> ext4_journal_destroy()
> /* trigger a commit (it will commit the last trnasaction) */
>
> **kjournald2**
> jbd2_journal_commit_transaction()
> ...
> ext4_inode_error()
> schedule_work(s_sb_upd_work))
Then this schedule work will never happen, since it will check if
s_journal_destroying flag is set.
>
> **workqueue**
> update_super_work()
> jbd2_journal_start(journal)
> start_this_handle()
> //This new trans will
> //not be committed.
>
> jbd2_journal_abort()
>
> /* wait for it to complete */
>
> flush_work(&sbi->s_sb_upd_work)
No need to again call the flush work here, since there is no new work
which will be scheduled right.
Am I missing something?
-ritesh
> jbd2_journal_destroy()
> journal->j_flags |= JBD2_UNMOUNT;
> jbd2_journal_commit_transaction() //it will commit nothing
>
> Thanks,
> Yi.