Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
From: Ojaswin Mujoo
Date: Sat Mar 08 2025 - 03:19:23 EST
On Sat, Mar 08, 2025 at 10:57:16AM +0800, Zhang Yi wrote:
> On 2025/3/8 1:26, Ojaswin Mujoo wrote:
> > On Fri, Mar 07, 2025 at 08:36:08PM +0800, Zhang Yi wrote:
> >> On 2025/3/7 18:27, Ojaswin Mujoo wrote:
> >>> On Fri, Mar 07, 2025 at 04:43:24PM +0800, Zhang Yi wrote:
> >>>> On 2025/3/7 16:13, Ojaswin Mujoo wrote:
> >>>>> On Fri, Mar 07, 2025 at 12:04:26PM +0530, Ojaswin Mujoo wrote:
> >>>>>> On Fri, Mar 07, 2025 at 10:49:28AM +0800, Zhang Yi wrote:
> >>>>>>> On 2025/3/6 22:28, Ojaswin Mujoo wrote:
> >>>>>>>> Presently we always BUG_ON if trying to start a transaction on a journal marked
> >>>>>>>> with JBD2_UNMOUNT, since this should never happen. However, while ltp running
> >>>>>>>> stress tests, it was observed that in case of some error handling paths, it is
> >>>>>>>> possible for update_super_work to start a transaction after the journal is
> >>>>>>>> destroyed eg:
> >>>>>>>>
> >>>>>>>> (umount)
> >>>>>>>> ext4_kill_sb
> >>>>>>>> kill_block_super
> >>>>>>>> generic_shutdown_super
> >>>>>>>> sync_filesystem /* commits all txns */
> >>>>>>>> evict_inodes
> >>>>>>>> /* might start a new txn */
> >>>>>>>> ext4_put_super
> >>>>>>>> flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
> >>>>>>>> jbd2_journal_destroy
> >>>>>>>> journal_kill_thread
> >>>>>>>> journal->j_flags |= JBD2_UNMOUNT;
> >>>>>>>> jbd2_journal_commit_transaction
> >>>>>>>> jbd2_journal_get_descriptor_buffer
> >>>>>>>> jbd2_journal_bmap
> >>>>>>>> ext4_journal_bmap
> >>>>>>>> ext4_map_blocks
> >>>>>>>> ...
> >>>>>>>> ext4_inode_error
> >>>>>>>> ext4_handle_error
> >>>>>>>> schedule_work(&sbi->s_sb_upd_work)
> >>>>>>>>
> >>>>>>>> /* work queue kicks in */
> >>>>>>>> update_super_work
> >>>>>>>> jbd2_journal_start
> >>>>>>>> start_this_handle
> >>>>>>>> BUG_ON(journal->j_flags &
> >>>>>>>> JBD2_UNMOUNT)
> >>>>>>>>
> >>>>>>>> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
> >>>>>>>> destroying only do a journaled (and deferred) update of sb if this flag is not
> >>>>>>>> set. Otherwise, just fallback to an un-journaled commit.
> >>>>>>>>
> >>>>>>>> We set sbi->s_journal_destroying = true only after all the FS updates are done
> >>>>>>>> during ext4_put_super() (except a running transaction that will get commited
> >>>>>>>> during jbd2_journal_destroy()). After this point, it is safe to commit the sb
> >>>>>>>> outside the journal as it won't race with a journaled update (refer
> >>>>>>>> 2d01ddc86606).
> >>>>>>>>
> >>>>>>>> Also, we don't need a similar check in ext4_grp_locked_error since it is only
> >>>>>>>> called from mballoc and AFAICT it would be always valid to schedule work here.
> >>>>>>>>
> >>>>>>>> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> >>>>>>>> Reported-by: Mahesh Kumar <maheshkumar657g@xxxxxxxxx>
> >>>>>>>> Suggested-by: Jan Kara <jack@xxxxxxx>
> >>>>>>>> Signed-off-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx>
> >>>>>>>> ---
> >>>>>>>> fs/ext4/ext4.h | 2 ++
> >>>>>>>> fs/ext4/ext4_jbd2.h | 8 ++++++++
> >>>>>>>> fs/ext4/super.c | 4 +++-
> >>>>>>>> 3 files changed, 13 insertions(+), 1 deletion(-)
> >>>>>>>>
> >>>>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> >>>>>>>> index 2b7d781bfcad..d48e93bd5690 100644
> >>>>>>>> --- a/fs/ext4/ext4.h
> >>>>>>>> +++ b/fs/ext4/ext4.h
> >>>>>>>> @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
> >>>>>>>> */
> >>>>>>>> struct work_struct s_sb_upd_work;
> >>>>>>>>
> >>>>>>>> + bool s_journal_destorying;
> >>>>>>>> +
> >>>>>>>> /* Atomic write unit values in bytes */
> >>>>>>>> unsigned int s_awu_min;
> >>>>>>>> unsigned int s_awu_max;
> >>>>>>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> >>>>>>>> index 9b3c9df02a39..6bd3ca84410d 100644
> >>>>>>>> --- a/fs/ext4/ext4_jbd2.h
> >>>>>>>> +++ b/fs/ext4/ext4_jbd2.h
> >>>>>>>> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
> >>>>>>>> {
> >>>>>>>> int err = 0;
> >>>>>>>>
> >>>>>>>> + /*
> >>>>>>>> + * At this point all pending FS updates should be done except a possible
> >>>>>>>> + * running transaction (which will commit in jbd2_journal_destroy). It
> >>>>>>>> + * is now safe for any new errors to directly commit superblock rather
> >>>>>>>> + * than going via journal.
> >>>>>>>> + */
> >>>>>>>> + sbi->s_journal_destorying = true;
> >>>>>>>> +
> >>>>>>>
> >>>>>>> Hi, Ojaswin!
> >>>>>>>
> >>>>>>> I'm afraid you still need to flush the superblock update work here,
> >>>>>>> otherwise I guess the race condition you mentioned in v1 could still
> >>>>>>> occur.
> >>>>>>>
> >>>>>>> ext4_put_super()
> >>>>>>> flush_work(&sbi->s_sb_upd_work)
> >>>>>>>
> >>>>>>> **kjournald2**
> >>>>>>> jbd2_journal_commit_transaction()
> >>>>>>> ...
> >>>>>>> ext4_inode_error()
> >>>>>>> /* JBD2_UNMOUNT not set */
> >>>>>>> schedule_work(s_sb_upd_work)
> >>>>>>>
> >>>>>>> **workqueue**
> >>>>>>> update_super_work
> >>>>>>> /* s_journal_destorying is not set */
> >>>>>>> if (journal && !s_journal_destorying)
> >>>>>>>
> >>>>>>> ext4_journal_destroy()
> >>>>>>> /* set s_journal_destorying */
> >>>>>>> sbi->s_journal_destorying = true;
> >>>>>>> jbd2_journal_destroy()
> >>>>>>> journal->j_flags |= JBD2_UNMOUNT;
> >>>>>>>
> >>>>>>> jbd2_journal_start()
> >>>>>>> start_this_handle()
> >>>>>>> BUG_ON(JBD2_UNMOUNT)
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Yi.
> >>>>>> Hi Yi,
> >>>>>>
> >>>>>> Yes you are right, somehow missed this edge case :(
> >>>>>>
> >>>>>> Alright then, we have to move out sbi->s_journal_destroying outside the
> >>>>>> helper. Just wondering if I should still let it be in
> >>>>>> ext4_journal_destroy and just add an extra s_journal_destroying = false
> >>>>>> before schedule_work(s_sb_upd_work), because it makes sense.
> >>>>>>
> >>>>>> Okay let me give it some thought but thanks for pointing this out!
> >>>>>>
> >>>>>> Regards,
> >>>>>> ojaswin
> >>>>>
> >>>>> Okay so thinking about it a bit more, I see you also suggested to flush
> >>>>> the work after marking sbi->s_journal_destroying. But will that solve
> >>>>> it?
> >>>>>
> >>>>> ext4_put_super()
> >>>>> flush_work(&sbi->s_sb_upd_work)
> >>>>>
> >>>>> **kjournald2**
> >>>>> jbd2_journal_commit_transaction()
> >>>>> ...
> >>>>> ext4_inode_error()
> >>>>> /* JBD2_UNMOUNT not set */
> >>>>> schedule_work(s_sb_upd_work)
> >>>>>
> >>>>> **workqueue**
> >>>>> update_super_work
> >>>>> /* s_journal_destorying is not set */
> >>>>> if (journal && !s_journal_destorying)
> >>>>>
> >>>>> ext4_journal_destroy()
> >>>>> /* set s_journal_destorying */
> >>>>> sbi->s_journal_destorying = true;
> >>>>> flush_work(&sbi->s_sb_upd_work)
> >>>>> schedule_work()
> >>>> ^^^^^^^^^^^^^^^
> >>>> where does this come from?
> >>>>
> >>>> After this flush_work, we can guarantee that the running s_sb_upd_work
> >>>> finishes before we set JBD2_UNMOUNT. Additionally, the journal will
> >>>> not commit transaction or call schedule_work() again because it has
> >>>> been aborted due to the previous error. Am I missing something?
> >>>>
> >>>> Thanks,
> >>>> Yi.
> >>>
> >>> Hmm, so I am thinking of a corner case in ext4_handle_error() where
> >>>
> >>> if(journal && !is_journal_destroying)
> >>>
> >>> is computed but schedule_work() not called yet, which is possible cause
> >>> the cmp followed by jump is not atomic in nature. If the schedule_work
> >>> is only called after we have done the flush then we end up with this:
> >>>
> >>> if (journal && !s_journal_destorying)
> >>> ext4_journal_destroy()
> >>> /* set s_journal_destorying */
> >>> sbi->s_journal_destorying = true;
> >>> flush_work(&sbi->s_sb_upd_work)
> >>> schedule_work()
> >>>
> >>> Which is possible IMO, although the window is tiny.
> >>
> >> Yeah, right!
> >> Sorry for misread the location where you add the "!s_journal_destorying"
> >> check, the graph I provided was in update_super_work(), which was wrong.
> >
> > Oh right, I also misread your trace but yes as discussed, even
> >
> > sbi->s_journal_destorying = true;
> > flush_work()
> > jbd2_journal_destroy()
> >
> > doesn't work.
> >
> >> The right one should be:
> >>
> >> ext4_put_super()
> >> flush_work(&sbi->s_sb_upd_work)
> >>
> >> **kjournald2**
> >> jbd2_journal_commit_transaction()
> >> ...
> >> ext4_inode_error()
> >> /* s_journal_destorying is not set */
> >> if (journal && !s_journal_destorying)
> >> (schedule_work(s_sb_upd_work)) //can be here
> >>
> >> ext4_journal_destroy()
> >> /* set s_journal_destorying */
> >> sbi->s_journal_destorying = true;
> >> jbd2_journal_destroy()
> >> journal->j_flags |= JBD2_UNMOUNT;
> >>
> >> (schedule_work(s_sb_upd_work)) //also can be here
> >>
> >> **workqueue**
> >> update_super_work()
> >> journal = sbi->s_journal //get journal
> >> kfree(journal)
> >> jbd2_journal_start(journal) //journal UAF
> >> start_this_handle()
> >> BUG_ON(JBD2_UNMOUNT) //bugon here
> >>
> >>
> >> So there are two problems here, the first one is the 'journal' UAF,
> >> the second one is triggering JBD2_UNMOUNT flag BUGON.
> >
> > Indeed, there's a possible UAF here as well.
> >
> >>
> >>>>>
> >>>>> As for the fix, how about we do something like this:
> >>>>>
> >>>>> ext4_put_super()
> >>>>>
> >>>>> flush_work(&sbi->s_sb_upd_work)
> >>>>> destroy_workqueue(sbi->rsv_conversion_wq);
> >>>>>
> >>>>> ext4_journal_destroy()
> >>>>> /* set s_journal_destorying */
> >>>>> sbi->s_journal_destorying = true;
> >>>>>
> >>>>> /* trigger a commit and wait for it to complete */
> >>>>>
> >>>>> flush_work(&sbi->s_sb_upd_work)
> >>>>>
> >>>>> jbd2_journal_destroy()
> >>>>> journal->j_flags |= JBD2_UNMOUNT;
> >>>>>
> >>>>> jbd2_journal_start()
> >>>>> start_this_handle()
> >>>>> BUG_ON(JBD2_UNMOUNT)
> >>>>>
> >>>>> Still giving this codepath some thought but seems like this might just
> >>>>> be enough to fix the race. Thoughts on this?
> >>>>>
> >>
> >> I think this solution should work, the forced commit and flush_work()
> >> should ensure that the last transaction is committed and that the
> >> potential work is done.
> >>
> >> Besides, the s_journal_destorying flag is set and check concurrently
> >> now, so we need WRITE_ONCE() and READ_ONCE() for it. Besides, what
> >> about adding a new flag into sbi->s_mount_state instead of adding
> >> new s_journal_destorying?
> >
> > Right, that makes sence. I will incorporate these changes in the next
> > revision.
> >
>
> Think about this again, it seems that we no longer need the destroying
> flag. Because we force to commit and wait for the **last** transaction to
> complete, and the flush work should also ensure that the last sb_update
> work to complete. Regardless of whether it starts a new handle in the
> last update_super_work(), it will not commit since the journal should
> have aborted. What are your thoughts?
>
> ext4_put_super()
> flush_work(&sbi->s_sb_upd_work)
> destroy_workqueue(sbi->rsv_conversion_wq)
>
> ext4_journal_destroy()
> /* trigger a commit (it will commit the last trnasaction) */
>
> **kjournald2**
> jbd2_journal_commit_transaction()
> ...
> ext4_inode_error()
> schedule_work(s_sb_upd_work))
>
> **workqueue**
> update_super_work()
> jbd2_journal_start(journal)
> start_this_handle()
> //This new trans will
> //not be committed.
>
> jbd2_journal_abort()
>
> /* wait for it to complete */
>
> flush_work(&sbi->s_sb_upd_work)
> jbd2_journal_destroy()
> journal->j_flags |= JBD2_UNMOUNT;
> jbd2_journal_commit_transaction() //it will commit nothing
>
> Thanks,
> Yi.
Hi Yi,
There's one more path for which we need the flag:
ext4_journal_destroy()
/* trigger a commit (it will commit the last trnasaction) */
**kjournald2**
jbd2_journal_commit_transaction()
journal->j_commit_callback()
ext4_journal_commit_callback()
ext4_maybe_update_superblock()
schedule_work()
/* start a transaction here */
flush_work()
jbd2_journal_destroy()
journal_kill_thread
flags |= JBD2_UNMOUNT
jbd2_journal_commit_transaction()
...
ext4_inode_error()
schedule_work(s_sb_upd_work))
/* update_super_work_tries to start the txn */
BUG_ON()
I think this to protect against this path we do need a flag.
Regards,
ojaswin
>