Re: [PATCH v2 2/3] ext4: avoid journaling sb update on error if journal is destroying
From: Ojaswin Mujoo
Date: Sat Mar 08 2025 - 07:58:09 EST
On Sat, Mar 08, 2025 at 06:10:14PM +0800, Baokun Li wrote:
> On 2025/3/8 17:58, Ojaswin Mujoo wrote:
> > On Sat, Mar 08, 2025 at 01:48:44PM +0530, Ojaswin Mujoo wrote:
> > > On Sat, Mar 08, 2025 at 10:57:16AM +0800, Zhang Yi wrote:
> > > > On 2025/3/8 1:26, Ojaswin Mujoo wrote:
> > > > > On Fri, Mar 07, 2025 at 08:36:08PM +0800, Zhang Yi wrote:
> > > > > > On 2025/3/7 18:27, Ojaswin Mujoo wrote:
> > > > > > > On Fri, Mar 07, 2025 at 04:43:24PM +0800, Zhang Yi wrote:
> > > > > > > > On 2025/3/7 16:13, Ojaswin Mujoo wrote:
> > > > > > > > > On Fri, Mar 07, 2025 at 12:04:26PM +0530, Ojaswin Mujoo wrote:
> > > > > > > > > > On Fri, Mar 07, 2025 at 10:49:28AM +0800, Zhang Yi wrote:
> > > > > > > > > > > On 2025/3/6 22:28, Ojaswin Mujoo wrote:
> > > > > > > > > > > > Presently we always BUG_ON if trying to start a transaction on a journal marked
> > > > > > > > > > > > with JBD2_UNMOUNT, since this should never happen. However, while ltp running
> > > > > > > > > > > > stress tests, it was observed that in case of some error handling paths, it is
> > > > > > > > > > > > possible for update_super_work to start a transaction after the journal is
> > > > > > > > > > > > destroyed eg:
> > > > > > > > > > > >
> > > > > > > > > > > > (umount)
> > > > > > > > > > > > ext4_kill_sb
> > > > > > > > > > > > kill_block_super
> > > > > > > > > > > > generic_shutdown_super
> > > > > > > > > > > > sync_filesystem /* commits all txns */
> > > > > > > > > > > > evict_inodes
> > > > > > > > > > > > /* might start a new txn */
> > > > > > > > > > > > ext4_put_super
> > > > > > > > > > > > flush_work(&sbi->s_sb_upd_work) /* flush the workqueue */
> > > > > > > > > > > > jbd2_journal_destroy
> > > > > > > > > > > > journal_kill_thread
> > > > > > > > > > > > journal->j_flags |= JBD2_UNMOUNT;
> > > > > > > > > > > > jbd2_journal_commit_transaction
> > > > > > > > > > > > jbd2_journal_get_descriptor_buffer
> > > > > > > > > > > > jbd2_journal_bmap
> > > > > > > > > > > > ext4_journal_bmap
> > > > > > > > > > > > ext4_map_blocks
> > > > > > > > > > > > ...
> > > > > > > > > > > > ext4_inode_error
> > > > > > > > > > > > ext4_handle_error
> > > > > > > > > > > > schedule_work(&sbi->s_sb_upd_work)
> > > > > > > > > > > >
> > > > > > > > > > > > /* work queue kicks in */
> > > > > > > > > > > > update_super_work
> > > > > > > > > > > > jbd2_journal_start
> > > > > > > > > > > > start_this_handle
> > > > > > > > > > > > BUG_ON(journal->j_flags &
> > > > > > > > > > > > JBD2_UNMOUNT)
> > > > > > > > > > > >
> > > > > > > > > > > > Hence, introduce a new sbi flag s_journal_destroying to indicate journal is
> > > > > > > > > > > > destroying only do a journaled (and deferred) update of sb if this flag is not
> > > > > > > > > > > > set. Otherwise, just fallback to an un-journaled commit.
> > > > > > > > > > > >
> > > > > > > > > > > > We set sbi->s_journal_destroying = true only after all the FS updates are done
> > > > > > > > > > > > during ext4_put_super() (except a running transaction that will get commited
> > > > > > > > > > > > during jbd2_journal_destroy()). After this point, it is safe to commit the sb
> > > > > > > > > > > > outside the journal as it won't race with a journaled update (refer
> > > > > > > > > > > > 2d01ddc86606).
> > > > > > > > > > > >
> > > > > > > > > > > > Also, we don't need a similar check in ext4_grp_locked_error since it is only
> > > > > > > > > > > > called from mballoc and AFAICT it would be always valid to schedule work here.
> > > > > > > > > > > >
> > > > > > > > > > > > Fixes: 2d01ddc86606 ("ext4: save error info to sb through journal if available")
> > > > > > > > > > > > Reported-by: Mahesh Kumar <maheshkumar657g@xxxxxxxxx>
> > > > > > > > > > > > Suggested-by: Jan Kara <jack@xxxxxxx>
> > > > > > > > > > > > Signed-off-by: Ojaswin Mujoo <ojaswin@xxxxxxxxxxxxx>
> > > > > > > > > > > > ---
> > > > > > > > > > > > fs/ext4/ext4.h | 2 ++
> > > > > > > > > > > > fs/ext4/ext4_jbd2.h | 8 ++++++++
> > > > > > > > > > > > fs/ext4/super.c | 4 +++-
> > > > > > > > > > > > 3 files changed, 13 insertions(+), 1 deletion(-)
> > > > > > > > > > > >
> > > > > > > > > > > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > > > > > > > > > > > index 2b7d781bfcad..d48e93bd5690 100644
> > > > > > > > > > > > --- a/fs/ext4/ext4.h
> > > > > > > > > > > > +++ b/fs/ext4/ext4.h
> > > > > > > > > > > > @@ -1728,6 +1728,8 @@ struct ext4_sb_info {
> > > > > > > > > > > > */
> > > > > > > > > > > > struct work_struct s_sb_upd_work;
> > > > > > > > > > > > + bool s_journal_destorying;
> > > > > > > > > > > > +
> > > > > > > > > > > > /* Atomic write unit values in bytes */
> > > > > > > > > > > > unsigned int s_awu_min;
> > > > > > > > > > > > unsigned int s_awu_max;
> > > > > > > > > > > > diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> > > > > > > > > > > > index 9b3c9df02a39..6bd3ca84410d 100644
> > > > > > > > > > > > --- a/fs/ext4/ext4_jbd2.h
> > > > > > > > > > > > +++ b/fs/ext4/ext4_jbd2.h
> > > > > > > > > > > > @@ -437,6 +437,14 @@ static inline int ext4_journal_destroy(struct ext4_sb_info *sbi, journal_t *jour
> > > > > > > > > > > > {
> > > > > > > > > > > > int err = 0;
> > > > > > > > > > > > + /*
> > > > > > > > > > > > + * At this point all pending FS updates should be done except a possible
> > > > > > > > > > > > + * running transaction (which will commit in jbd2_journal_destroy). It
> > > > > > > > > > > > + * is now safe for any new errors to directly commit superblock rather
> > > > > > > > > > > > + * than going via journal.
> > > > > > > > > > > > + */
> > > > > > > > > > > > + sbi->s_journal_destorying = true;
> > > > > > > > > > > > +
> > > > > > > > > > > Hi, Ojaswin!
> > > > > > > > > > >
> > > > > > > > > > > I'm afraid you still need to flush the superblock update work here,
> > > > > > > > > > > otherwise I guess the race condition you mentioned in v1 could still
> > > > > > > > > > > occur.
> > > > > > > > > > >
> > > > > > > > > > > ext4_put_super()
> > > > > > > > > > > flush_work(&sbi->s_sb_upd_work)
> > > > > > > > > > >
> > > > > > > > > > > **kjournald2**
> > > > > > > > > > > jbd2_journal_commit_transaction()
> > > > > > > > > > > ...
> > > > > > > > > > > ext4_inode_error()
> > > > > > > > > > > /* JBD2_UNMOUNT not set */
> > > > > > > > > > > schedule_work(s_sb_upd_work)
> > > > > > > > > > >
> > > > > > > > > > > **workqueue**
> > > > > > > > > > > update_super_work
> > > > > > > > > > > /* s_journal_destorying is not set */
> > > > > > > > > > > if (journal && !s_journal_destorying)
> > > > > > > > > > >
> > > > > > > > > > > ext4_journal_destroy()
> > > > > > > > > > > /* set s_journal_destorying */
> > > > > > > > > > > sbi->s_journal_destorying = true;
> > > > > > > > > > > jbd2_journal_destroy()
> > > > > > > > > > > journal->j_flags |= JBD2_UNMOUNT;
> > > > > > > > > > >
> > > > > > > > > > > jbd2_journal_start()
> > > > > > > > > > > start_this_handle()
> > > > > > > > > > > BUG_ON(JBD2_UNMOUNT)
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Yi.
> > > > > > > > > > Hi Yi,
> > > > > > > > > >
> > > > > > > > > > Yes you are right, somehow missed this edge case :(
> > > > > > > > > >
> > > > > > > > > > Alright then, we have to move out sbi->s_journal_destroying outside the
> > > > > > > > > > helper. Just wondering if I should still let it be in
> > > > > > > > > > ext4_journal_destroy and just add an extra s_journal_destroying = false
> > > > > > > > > > before schedule_work(s_sb_upd_work), because it makes sense.
> > > > > > > > > >
> > > > > > > > > > Okay let me give it some thought but thanks for pointing this out!
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > ojaswin
> > > > > > > > > Okay so thinking about it a bit more, I see you also suggested to flush
> > > > > > > > > the work after marking sbi->s_journal_destroying. But will that solve
> > > > > > > > > it?
> > > > > > > > >
> > > > > > > > > ext4_put_super()
> > > > > > > > > flush_work(&sbi->s_sb_upd_work)
> > > > > > > > > **kjournald2**
> > > > > > > > > jbd2_journal_commit_transaction()
> > > > > > > > > ...
> > > > > > > > > ext4_inode_error()
> > > > > > > > > /* JBD2_UNMOUNT not set */
> > > > > > > > > schedule_work(s_sb_upd_work)
> > > > > > > > > **workqueue**
> > > > > > > > > update_super_work
> > > > > > > > > /* s_journal_destorying is not set */
> > > > > > > > > if (journal && !s_journal_destorying)
> > > > > > > > > ext4_journal_destroy()
> > > > > > > > > /* set s_journal_destorying */
> > > > > > > > > sbi->s_journal_destorying = true;
> > > > > > > > > flush_work(&sbi->s_sb_upd_work)
> > > > > > > > > schedule_work()
> > > > > > > > ^^^^^^^^^^^^^^^
> > > > > > > > where does this come from?
> > > > > > > >
> > > > > > > > After this flush_work, we can guarantee that the running s_sb_upd_work
> > > > > > > > finishes before we set JBD2_UNMOUNT. Additionally, the journal will
> > > > > > > > not commit transaction or call schedule_work() again because it has
> > > > > > > > been aborted due to the previous error. Am I missing something?
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Yi.
> > > > > > > Hmm, so I am thinking of a corner case in ext4_handle_error() where
> > > > > > >
> > > > > > > if(journal && !is_journal_destroying)
> > > > > > >
> > > > > > > is computed but schedule_work() not called yet, which is possible cause
> > > > > > > the cmp followed by jump is not atomic in nature. If the schedule_work
> > > > > > > is only called after we have done the flush then we end up with this:
> > > > > > >
> > > > > > > if (journal && !s_journal_destorying)
> > > > > > > ext4_journal_destroy()
> > > > > > > /* set s_journal_destorying */
> > > > > > > sbi->s_journal_destorying = true;
> > > > > > > flush_work(&sbi->s_sb_upd_work)
> > > > > > > schedule_work()
> > > > > > >
> > > > > > > Which is possible IMO, although the window is tiny.
> > > > > > Yeah, right!
> > > > > > Sorry for misread the location where you add the "!s_journal_destorying"
> > > > > > check, the graph I provided was in update_super_work(), which was wrong.
> > > > > Oh right, I also misread your trace but yes as discussed, even
> > > > >
> > > > > sbi->s_journal_destorying = true;
> > > > > flush_work()
> > > > > jbd2_journal_destroy()
> > > > >
> > > > > doesn't work.
> > > > >
> > > > > > The right one should be:
> > > > > >
> > > > > > ext4_put_super()
> > > > > > flush_work(&sbi->s_sb_upd_work)
> > > > > >
> > > > > > **kjournald2**
> > > > > > jbd2_journal_commit_transaction()
> > > > > > ...
> > > > > > ext4_inode_error()
> > > > > > /* s_journal_destorying is not set */
> > > > > > if (journal && !s_journal_destorying)
> > > > > > (schedule_work(s_sb_upd_work)) //can be here
> > > > > >
> > > > > > ext4_journal_destroy()
> > > > > > /* set s_journal_destorying */
> > > > > > sbi->s_journal_destorying = true;
> > > > > > jbd2_journal_destroy()
> > > > > > journal->j_flags |= JBD2_UNMOUNT;
> > > > > >
> > > > > > (schedule_work(s_sb_upd_work)) //also can be here
> > > > > >
> > > > > > **workqueue**
> > > > > > update_super_work()
> > > > > > journal = sbi->s_journal //get journal
> > > > > > kfree(journal)
> > > > > > jbd2_journal_start(journal) //journal UAF
> > > > > > start_this_handle()
> > > > > > BUG_ON(JBD2_UNMOUNT) //bugon here
> > > > > >
> > > > > >
> > > > > > So there are two problems here, the first one is the 'journal' UAF,
> > > > > > the second one is triggering JBD2_UNMOUNT flag BUGON.
> > > > > Indeed, there's a possible UAF here as well.
> > > > >
> > > > > > > > > As for the fix, how about we do something like this:
> > > > > > > > >
> > > > > > > > > ext4_put_super()
> > > > > > > > >
> > > > > > > > > flush_work(&sbi->s_sb_upd_work)
> > > > > > > > > destroy_workqueue(sbi->rsv_conversion_wq);
> > > > > > > > >
> > > > > > > > > ext4_journal_destroy()
> > > > > > > > > /* set s_journal_destorying */
> > > > > > > > > sbi->s_journal_destorying = true;
> > > > > > > > >
> > > > > > > > > /* trigger a commit and wait for it to complete */
> > > > > > > > >
> > > > > > > > > flush_work(&sbi->s_sb_upd_work)
> > > > > > > > >
> > > > > > > > > jbd2_journal_destroy()
> > > > > > > > > journal->j_flags |= JBD2_UNMOUNT;
> > > > > > > > > jbd2_journal_start()
> > > > > > > > > start_this_handle()
> > > > > > > > > BUG_ON(JBD2_UNMOUNT)
> > > > > > > > >
> > > > > > > > > Still giving this codepath some thought but seems like this might just
> > > > > > > > > be enough to fix the race. Thoughts on this?
> > > > > > > > >
> > > > > > I think this solution should work, the forced commit and flush_work()
> > > > > > should ensure that the last transaction is committed and that the
> > > > > > potential work is done.
> > > > > >
> > > > > > Besides, the s_journal_destorying flag is set and check concurrently
> > > > > > now, so we need WRITE_ONCE() and READ_ONCE() for it. Besides, what
> > > > > > about adding a new flag into sbi->s_mount_state instead of adding
> > > > > > new s_journal_destorying?
> > > > > Right, that makes sence. I will incorporate these changes in the next
> > > > > revision.
> > > > >
> > > > Think about this again, it seems that we no longer need the destroying
> > > > flag. Because we force to commit and wait for the **last** transaction to
> > > > complete, and the flush work should also ensure that the last sb_update
> > > > work to complete. Regardless of whether it starts a new handle in the
> > > > last update_super_work(), it will not commit since the journal should
> > > > have aborted. What are your thoughts?
> > > >
> > > > ext4_put_super()
> > > > flush_work(&sbi->s_sb_upd_work)
> > > > destroy_workqueue(sbi->rsv_conversion_wq)
> > > >
> > > > ext4_journal_destroy()
> > > > /* trigger a commit (it will commit the last trnasaction) */
> > > >
> > > > **kjournald2**
> > > > jbd2_journal_commit_transaction()
> > > > ...
> > > > ext4_inode_error()
> > > > schedule_work(s_sb_upd_work))
> > > >
> > > > **workqueue**
> > > > update_super_work()
> > > > jbd2_journal_start(journal)
> > > > start_this_handle()
> > > > //This new trans will
> > > > //not be committed.
> > > >
> > > > jbd2_journal_abort()
> > > >
> > > > /* wait for it to complete */
> > > >
> > > > flush_work(&sbi->s_sb_upd_work)
> > > > jbd2_journal_destroy()
> > > > journal->j_flags |= JBD2_UNMOUNT;
> > > > jbd2_journal_commit_transaction() //it will commit nothing
> > > >
> > > > Thanks,
> > > > Yi.
> > > Hi Yi,
> > >
> > > There's one more path for which we need the flag:
> > >
> > > ext4_journal_destroy()
> > > /* trigger a commit (it will commit the last trnasaction) */
> > > **kjournald2**
> > > jbd2_journal_commit_transaction()
> > > journal->j_commit_callback()
> > > ext4_journal_commit_callback()
> > > ext4_maybe_update_superblock()
> > > schedule_work()
> > > /* start a transaction here */
> > > flush_work()
> > > jbd2_journal_destroy()
> > > journal_kill_thread
> > > flags |= JBD2_UNMOUNT
> > > jbd2_journal_commit_transaction()
> > > ...
> > > ext4_inode_error()
> > > schedule_work(s_sb_upd_work))
> > > /* update_super_work_tries to start the txn */
> > > BUG_ON()
> > Oops the formatting is wrong, here's the trace:
> >
> > ext4_journal_destroy()
> > /* trigger a commit (it will commit the last trnasaction) */
> >
> > **kjournald2**
> > jbd2_journal_commit_transaction()
> > journal->j_commit_callback()
> > ext4_journal_commit_callback()
> > ext4_maybe_update_superblock()
> > schedule_work()
> At this point, SB_ACTIVE should have been cleared,
> so ext4_maybe_update_superblock() should do nothing.
>
> With this in mind, it could be the case that an
> additional flag is no longer needed.
Yes got it now, all the backtraces are confusing me haha
Alright then, I feel we can take 2 approaches now.
- Without the flag, (flushing sb update wq 2 times):
This approach looks like how Yi has described here [1] however
this relies on the fact that the last update_super_work is only
called in journal is aborted and hence will never start a txn.
This is possible because we flush the sb 2 times:
ext4_put_super()
flush_work(s_sb_upd_work) // 1st flush, clears all pending updates
ext4_journal_destroy
/* commit & wait for txn */
flush_work(s_sb_upd_work) // 2nd flush, only has work if journal is aborted */
The first flush flushes all pending updates and 2nd one is only invoked
when the journal commit faces an error and hence never starts a new
txn. Thus everything works.
[1] https://lore.kernel.org/all/cover.1741270780.git.ojaswin@xxxxxxxxxxxxx/T/#m3de5dcae6afa979d01281f64b0c088131e72fc92
- With the flag. (single sb flush).
This is an alternate approach where we remove the flush_work() from
ext4_put_super() and only keep 1 in ext4_journal_destroy(). With this
we need to explicitly rely on a sbi/mount flag because the flush can
actually start a txn. WITHOUT the flag we can run into this:
**kjournald2**
jbd2_journal_commit_transaction()
journal->j_commit_callback()
ext4_journal_commit_callback()
ext4_maybe_update_superblock()
schedule_work()
ext4_put_super()
ext4_journal_destroy
/* commit & wait for txn */
flush_work(s_sb_upd_work) // This will have sb updates even if journal not aborted
/* update_super_work starts a txn */
jbd2_destroy_journal
flags |= JBD2_UNMOUNT
jbd2_journal_commit_transaction
/* error */
ext4_handle_error
scheduled_work()
/*update super work hits BUG_ON */
Hence the flag is needed in this approach.
I actually prefer the 2nd approach with the flag because it seems easier
to maintain in the long run than relying on the fragile
flush-commit-flush sequence, which might get silently broken as new code
is added.
I hope I didn't miss something this time. Thoughts? :)
Regards,
ojaswin
>
>
> Regards,
> Baokun
> >
> > /* update_super_work starts a new txn here */
> > flush_work()
> > jbd2_journal_destroy()
> > journal_kill_thread
> > flags |= JBD2_UNMOUNT
> > jbd2_journal_commit_transaction()
> > ...
> > ext4_inode_error()
> > schedule_work(s_sb_upd_work))
> > /* update_super_work_tries to start the txn */
> > BUG_ON()
> >
> > > I think this to protect against this path we do need a flag.
> > >
> > > Regards,
> > > ojaswin
>
>