Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying

From: Ojaswin Mujoo
Date: Sat Mar 08 2025 - 04:59:24 EST


On Sat, Mar 08, 2025 at 01:48:44PM +0530, Ojaswin Mujoo wrote:
> On Sat, Mar 08, 2025 at 10:57:16AM +0800, Zhang Yi wrote:
> > On 2025/3/8 1:26, Ojaswin Mujoo wrote:
> > > On Fri, Mar 07, 2025 at 08:36:08PM +0800, Zhang Yi wrote:
> > >> On 2025/3/7 18:27, Ojaswin Mujoo wrote:
> > >>> On Fri, Mar 07, 2025 at 04:43:24PM +0800, Zhang Yi wrote:
> > >>>> On 2025/3/7 16:13, Ojaswin Mujoo wrote:
> > >>>>> On Fri, Mar 07, 2025 at 12:04:26PM +0530, Ojaswin Mujoo wrote:
> > >>>>>> On Fri, Mar 07, 2025 at 10:49:28AM +0800, Zhang Yi wrote:
> > >>>>>>> On 2025/3/6 22:28, Ojaswin Mujoo wrote:
> > >>>>>>>> Presently we always BUG_ON if trying to start a transaction on a journal marked
> > >>>>>>>> with JBD2_UNMOUNT, since this should never happen. However, while ltp running
> > >>>>>>>> stress tests, it was observed that in case of some error handling paths, it is
> > >>>>>>>> possible for update_super_work to start a transaction after the journal is
> > >>>>>>>> destroyed eg:
> > >>>>>>>>
> > >>>>>>>> (umount)
> > >>>>>>>> ext4_kill_sb
> > >>>>>>>> kill_block_super
> > >>>>>>>> generic_shutdown_super
> > >>>>>>>> sync_filesystem /* commits all txns */
> > >>>>>>>> evict_inodes
> > >>>>>>>> /* might start a new txn */
> > >>>>>>>> ext4_put_super
> > >>>>>>>> flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
> > >>>>>>>> jbd2_journal_destroy
> > >>>>>>>> journal_kill_thread
> > >>>>>>>> journal->j_flags |= JBD2_UNMOUNT;
> > >>>>>>>> jbd2_journal_commit_transaction
> > >>>>>>>> jbd2_journal_get_descriptor_buffer
> > >>>>>>>> jbd2_journal_bmap
> > >>>>>>>> ext4_journal_bmap
> > >>>>>>>> ext4_map_blocks
> > >>>>>>>> ...
> > >>>>>>>> ext4_inode_error
> > >>>>>>>> ext4_handle_error
> > >>>>>>>> schedule_work(&sbi->s_sb_upd_work)
> > >>>>>>>>
> > >>>>>>>> /* work queue kicks in */
> > >>>>>>>> update_super_work
> > >>>>>>>> jbd2_journal_start
> > >>>>>>>> start_this_handle
> > >>>>>>>> BUG_ON(journal->j_flags &
> > >>>>>>>> JBD2_UNMOUNT)
> > >>>>>>>>
> > >>>>>>>> Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
> > >>>>>>>> destroying only do a journaled (and deferred) update of sb if this flag is not
> > >>>>>>>> set. Otherwise, just fallback to an un-journaled commit.
> > >>>>>>>>
> > >>>>>>>> We set sbi->s_journal_destroying = true only after all the FS updates are done
> > >>>>>>>> during ext4_put_super() (except a running transaction that will get commited
> > >>>>>>>> during jbd2_journal_destroy()). After this point, it is safe to commit the sb
> > >>>>>>>> outside the journal as it won't race with a journaled update (refer
> > >>>>>>>> 2d01ddc86606).
> > >>>>>>>>
> > >>>>>>>> Also, we don't need a similar check in ext4_grp_locked_error since it is only
> > >>>>>>>> called from mballoc and AFAICT it would be always valid to schedule work here.
> > >>>>>>>>
> > >>>>>>>> Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> > >>>>>>>> Reported-by: Mahesh Kumar <maheshkumar657g@xxxxxxxxx>
> > >>>>>>>> Suggested-by: Jan Kara <jack@xxxxxxx>
> > >>>>>>>> Signed-off-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx>
> > >>>>>>>> ---
> > >>>>>>>> fs/ext4/ext4.h | 2 ++
> > >>>>>>>> fs/ext4/ext4_jbd2.h | 8 ++++++++
> > >>>>>>>> fs/ext4/super.c | 4 +++-
> > >>>>>>>> 3 files changed, 13 insertions(+), 1 deletion(-)
> > >>>>>>>>
> > >>>>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > >>>>>>>> index 2b7d781bfcad..d48e93bd5690 100644
> > >>>>>>>> --- a/fs/ext4/ext4.h
> > >>>>>>>> +++ b/fs/ext4/ext4.h
> > >>>>>>>> @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
> > >>>>>>>> */
> > >>>>>>>> struct work_struct s_sb_upd_work;
> > >>>>>>>>
> > >>>>>>>> + bool s_journal_destorying;
> > >>>>>>>> +
> > >>>>>>>> /* Atomic write unit values in bytes */
> > >>>>>>>> unsigned int s_awu_min;
> > >>>>>>>> unsigned int s_awu_max;
> > >>>>>>>> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> > >>>>>>>> index 9b3c9df02a39..6bd3ca84410d 100644
> > >>>>>>>> --- a/fs/ext4/ext4_jbd2.h
> > >>>>>>>> +++ b/fs/ext4/ext4_jbd2.h
> > >>>>>>>> @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
> > >>>>>>>> {
> > >>>>>>>> int err = 0;
> > >>>>>>>>
> > >>>>>>>> + /*
> > >>>>>>>> + * At this point all pending FS updates should be done except a possible
> > >>>>>>>> + * running transaction (which will commit in jbd2_journal_destroy). It
> > >>>>>>>> + * is now safe for any new errors to directly commit superblock rather
> > >>>>>>>> + * than going via journal.
> > >>>>>>>> + */
> > >>>>>>>> + sbi->s_journal_destorying = true;
> > >>>>>>>> +
> > >>>>>>>
> > >>>>>>> Hi, Ojaswin!
> > >>>>>>>
> > >>>>>>> I'm afraid you still need to flush the superblock update work here,
> > >>>>>>> otherwise I guess the race condition you mentioned in v1 could still
> > >>>>>>> occur.
> > >>>>>>>
> > >>>>>>> ext4_put_super()
> > >>>>>>> flush_work(&sbi->s_sb_upd_work)
> > >>>>>>>
> > >>>>>>> **kjournald2**
> > >>>>>>> jbd2_journal_commit_transaction()
> > >>>>>>> ...
> > >>>>>>> ext4_inode_error()
> > >>>>>>> /* JBD2_UNMOUNT not set */
> > >>>>>>> schedule_work(s_sb_upd_work)
> > >>>>>>>
> > >>>>>>> **workqueue**
> > >>>>>>> update_super_work
> > >>>>>>> /* s_journal_destorying is not set */
> > >>>>>>> if (journal && !s_journal_destorying)
> > >>>>>>>
> > >>>>>>> ext4_journal_destroy()
> > >>>>>>> /* set s_journal_destorying */
> > >>>>>>> sbi->s_journal_destorying = true;
> > >>>>>>> jbd2_journal_destroy()
> > >>>>>>> journal->j_flags |= JBD2_UNMOUNT;
> > >>>>>>>
> > >>>>>>> jbd2_journal_start()
> > >>>>>>> start_this_handle()
> > >>>>>>> BUG_ON(JBD2_UNMOUNT)
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>> Yi.
> > >>>>>> Hi Yi,
> > >>>>>>
> > >>>>>> Yes you are right, somehow missed this edge case :(
> > >>>>>>
> > >>>>>> Alright then, we have to move out sbi->s_journal_destroying outside the
> > >>>>>> helper. Just wondering if I should still let it be in
> > >>>>>> ext4_journal_destroy and just add an extra s_journal_destroying = false
> > >>>>>> before schedule_work(s_sb_upd_work), because it makes sense.
> > >>>>>>
> > >>>>>> Okay let me give it some thought but thanks for pointing this out!
> > >>>>>>
> > >>>>>> Regards,
> > >>>>>> ojaswin
> > >>>>>
> > >>>>> Okay so thinking about it a bit more, I see you also suggested to flush
> > >>>>> the work after marking sbi->s_journal_destroying. But will that solve
> > >>>>> it?
> > >>>>>
> > >>>>> ext4_put_super()
> > >>>>> flush_work(&sbi->s_sb_upd_work)
> > >>>>>
> > >>>>> **kjournald2**
> > >>>>> jbd2_journal_commit_transaction()
> > >>>>> ...
> > >>>>> ext4_inode_error()
> > >>>>> /* JBD2_UNMOUNT not set */
> > >>>>> schedule_work(s_sb_upd_work)
> > >>>>>
> > >>>>> **workqueue**
> > >>>>> update_super_work
> > >>>>> /* s_journal_destorying is not set */
> > >>>>> if (journal && !s_journal_destorying)
> > >>>>>
> > >>>>> ext4_journal_destroy()
> > >>>>> /* set s_journal_destorying */
> > >>>>> sbi->s_journal_destorying = true;
> > >>>>> flush_work(&sbi->s_sb_upd_work)
> > >>>>> schedule_work()
> > >>>> ^^^^^^^^^^^^^^^
> > >>>> where does this come from?
> > >>>>
> > >>>> After this flush_work, we can guarantee that the running s_sb_upd_work
> > >>>> finishes before we set JBD2_UNMOUNT. Additionally, the journal will
> > >>>> not commit transaction or call schedule_work() again because it has
> > >>>> been aborted due to the previous error. Am I missing something?
> > >>>>
> > >>>> Thanks,
> > >>>> Yi.
> > >>>
> > >>> Hmm, so I am thinking of a corner case in ext4_handle_error() where
> > >>>
> > >>> if(journal && !is_journal_destroying)
> > >>>
> > >>> is computed but schedule_work() not called yet, which is possible cause
> > >>> the cmp followed by jump is not atomic in nature. If the schedule_work
> > >>> is only called after we have done the flush then we end up with this:
> > >>>
> > >>> if (journal && !s_journal_destorying)
> > >>> ext4_journal_destroy()
> > >>> /* set s_journal_destorying */
> > >>> sbi->s_journal_destorying = true;
> > >>> flush_work(&sbi->s_sb_upd_work)
> > >>> schedule_work()
> > >>>
> > >>> Which is possible IMO, although the window is tiny.
> > >>
> > >> Yeah, right!
> > >> Sorry for misread the location where you add the "!s_journal_destorying"
> > >> check, the graph I provided was in update_super_work(), which was wrong.
> > >
> > > Oh right, I also misread your trace but yes as discussed, even
> > >
> > > sbi->s_journal_destorying = true;
> > > flush_work()
> > > jbd2_journal_destroy()
> > >
> > > doesn't work.
> > >
> > >> The right one should be:
> > >>
> > >> ext4_put_super()
> > >> flush_work(&sbi->s_sb_upd_work)
> > >>
> > >> **kjournald2**
> > >> jbd2_journal_commit_transaction()
> > >> ...
> > >> ext4_inode_error()
> > >> /* s_journal_destorying is not set */
> > >> if (journal && !s_journal_destorying)
> > >> (schedule_work(s_sb_upd_work)) //can be here
> > >>
> > >> ext4_journal_destroy()
> > >> /* set s_journal_destorying */
> > >> sbi->s_journal_destorying = true;
> > >> jbd2_journal_destroy()
> > >> journal->j_flags |= JBD2_UNMOUNT;
> > >>
> > >> (schedule_work(s_sb_upd_work)) //also can be here
> > >>
> > >> **workqueue**
> > >> update_super_work()
> > >> journal = sbi->s_journal //get journal
> > >> kfree(journal)
> > >> jbd2_journal_start(journal) //journal UAF
> > >> start_this_handle()
> > >> BUG_ON(JBD2_UNMOUNT) //bugon here
> > >>
> > >>
> > >> So there are two problems here, the first one is the 'journal' UAF,
> > >> the second one is triggering JBD2_UNMOUNT flag BUGON.
> > >
> > > Indeed, there's a possible UAF here as well.
> > >
> > >>
> > >>>>>
> > >>>>> As for the fix, how about we do something like this:
> > >>>>>
> > >>>>> ext4_put_super()
> > >>>>>
> > >>>>> flush_work(&sbi->s_sb_upd_work)
> > >>>>> destroy_workqueue(sbi->rsv_conversion_wq);
> > >>>>>
> > >>>>> ext4_journal_destroy()
> > >>>>> /* set s_journal_destorying */
> > >>>>> sbi->s_journal_destorying = true;
> > >>>>>
> > >>>>> /* trigger a commit and wait for it to complete */
> > >>>>>
> > >>>>> flush_work(&sbi->s_sb_upd_work)
> > >>>>>
> > >>>>> jbd2_journal_destroy()
> > >>>>> journal->j_flags |= JBD2_UNMOUNT;
> > >>>>>
> > >>>>> jbd2_journal_start()
> > >>>>> start_this_handle()
> > >>>>> BUG_ON(JBD2_UNMOUNT)
> > >>>>>
> > >>>>> Still giving this codepath some thought but seems like this might just
> > >>>>> be enough to fix the race. Thoughts on this?
> > >>>>>
> > >>
> > >> I think this solution should work, the forced commit and flush_work()
> > >> should ensure that the last transaction is committed and that the
> > >> potential work is done.
> > >>
> > >> Besides, the s_journal_destorying flag is set and check concurrently
> > >> now, so we need WRITE_ONCE() and READ_ONCE() for it. Besides, what
> > >> about adding a new flag into sbi->s_mount_state instead of adding
> > >> new s_journal_destorying?
> > >
> > > Right, that makes sence. I will incorporate these changes in the next
> > > revision.
> > >
> >
> > Think about this again, it seems that we no longer need the destroying
> > flag. Because we force to commit and wait for the **last** transaction to
> > complete, and the flush work should also ensure that the last sb_update
> > work to complete. Regardless of whether it starts a new handle in the
> > last update_super_work(), it will not commit since the journal should
> > have aborted. What are your thoughts?
> >
> > ext4_put_super()
> > flush_work(&sbi->s_sb_upd_work)
> > destroy_workqueue(sbi->rsv_conversion_wq)
> >
> > ext4_journal_destroy()
> > /* trigger a commit (it will commit the last trnasaction) */
> >
> > **kjournald2**
> > jbd2_journal_commit_transaction()
> > ...
> > ext4_inode_error()
> > schedule_work(s_sb_upd_work))
> >
> > **workqueue**
> > update_super_work()
> > jbd2_journal_start(journal)
> > start_this_handle()
> > //This new trans will
> > //not be committed.
> >
> > jbd2_journal_abort()
> >
> > /* wait for it to complete */
> >
> > flush_work(&sbi->s_sb_upd_work)
> > jbd2_journal_destroy()
> > journal->j_flags |= JBD2_UNMOUNT;
> > jbd2_journal_commit_transaction() //it will commit nothing
> >
> > Thanks,
> > Yi.
>
> Hi Yi,
>
> There's one more path for which we need the flag:
>
> ext4_journal_destroy()
> /* trigger a commit (it will commit the last trnasaction) */
>
> **kjournald2**
> jbd2_journal_commit_transaction()
> journal->j_commit_callback()
> ext4_journal_commit_callback()
> ext4_maybe_update_superblock()
> schedule_work()
> /* start a transaction here */
> flush_work()
> jbd2_journal_destroy()
> journal_kill_thread
> flags |= JBD2_UNMOUNT
> jbd2_journal_commit_transaction()
> ...
> ext4_inode_error()
> schedule_work(s_sb_upd_work))
> /* update_super_work_tries to start the txn */
> BUG_ON()

Oops the formatting is wrong, here's the trace:

ext4_journal_destroy()
/* trigger a commit (it will commit the last trnasaction) */

**kjournald2**
jbd2_journal_commit_transaction()
journal->j_commit_callback()
ext4_journal_commit_callback()
ext4_maybe_update_superblock()
schedule_work()

/* update_super_work starts a new txn here */
flush_work()
jbd2_journal_destroy()
journal_kill_thread
flags |= JBD2_UNMOUNT
jbd2_journal_commit_transaction()
...
ext4_inode_error()
schedule_work(s_sb_upd_work))
/* update_super_work_tries to start the txn */
BUG_ON()

>
> I think this to protect against this path we do need a flag.
>
> Regards,
> ojaswin
> >