Re: [PATCH v2 0/3] Fix a BUG_ON crashing the kernel in start_this_handle
From: Ojaswin Mujoo
Date: Sat Mar 08 2025 - 08:08:14 EST
On Sat, Mar 08, 2025 at 09:09:28AM +0800, Baokun Li wrote:
> On 2025/3/6 22:28, Ojaswin Mujoo wrote:
> > ** Changes since v1 [1] **
> >
> > * Picked up RVBs from Jan and Ritesh
> > * In patch 2/3, we now use a flag in sbi instead of SB_ACITVE
> > to determine when to journal sb vs when to commit directly.
> > * Added a prep patch 1/3
> >
> > [1] https://lore.kernel.org/linux-ext4/cover.1740212945.git.ojaswin@xxxxxxxxxxxxx/T/#m5e659425b8c8fe2ac01e7242b77fed315ff89db4
> >
> > @Baokun, I didn't get a chance to look into the journal_inode
> > modifications we were discussing in [2]. I'll try to spend some time and
> > send that as a separate patch. Hope that's okay
> >
> > [2] https://lore.kernel.org/linux-ext4/cover.1740212945.git.ojaswin@xxxxxxxxxxxxx/T/#mad8feb44d9b6ddadf87830b92caa7b78d902dc05
> That's fine, it's not a priority. And if this patch set makes sure we
> don't crash when things go wrong, I'm okay with leaving it as is.
>
> It's possible that jbd2_journal_commit_transaction() could call
> ext4_handle_error() in other places as the code evolves. Fixing known
> problems and protecting against potential ones is always a good thing.
Yep thats true, I did spend some time on this since the codepath was a
bit unfamiliar to me. Seems like a straighforward enough change. I'll
add it to the next patch.
thanks,
ojaswin
>
>
> Cheers,
> Baokun
> > ** Original Cover **
> >
> > When running LTP stress tests on ext4, after a multiday run we seemed to
> > have hit the following BUG_ON:
> >
> > [NIP : start_this_handle+268]
> > #3 [c000001067c27a40] start_this_handle at c008000004d40f74 [jbd2] (unreliable)
> > #4 [c000001067c27b60] jbd2__journal_start at c008000004d415cc [jbd2]
> > #5 [c000001067c27be0] update_super_work at c0080000053f9758 [ext4]
> > #6 [c000001067c27c70] process_one_work at c000000000188790
> > #7 [c000001067c27d20] worker_thread at c00000000018973c
> > #8 [c000001067c27dc0] kthread at c000000000196c84
> > #9 [c000001067c27e10] ret_from_kernel_thread at c00000000000cd64
> >
> > Which comes out to
> >
> > 382 repeat:
> > 383 read_lock(&journal->j_state_lock);
> > * 384 BUG_ON(journal->j_flags & JBD2_UNMOUNT);
> > 385 if (is_journal_aborted(journal) ||
> > 386 (journal->j_errno != 0 && !(journal->j_flags & JBD2_ACK_ERR))) {
> > 387 read_unlock(&journal->j_state_lock);
> >
> >
> > Initially this seemed like it should never happen but upon crash
> > analysis it seems like it could indeed be hit as described in patch 1/2.
> >
> > I would like to add that through the logs we only knew that:
> >
> > - ext4_journal_bmap -> ext4_map_blocks is failing with EFSCORRUPTED.
> > - update_super_work had hit the BUG_ON
> >
> > I was not able to hit this bug again (without modifying the kernel to
> > inject errors) but the above backtrace seems to be one possible paths
> > where this BUG_ON can be hit. Rest of the analysis and fix is in patch
> > 2/3. Patch 3 is just a small tweak that i found helpful while debugging.
> >
> > That being said, journalling is something I'm not very familiar with and
> > there might be gaps in my understanding so thoughts and suggestions are
> > welcome.
> >
> > Ojaswin Mujoo (3):
> > ext4: define ext4_journal_destroy wrapper
> > ext4: avoid journaling sb update on error if journal is destroying
> > ext4: Make sb update interval tunable
> >
> > fs/ext4/ext4.h | 11 +++++++++++
> > fs/ext4/ext4_jbd2.h | 22 ++++++++++++++++++++++
> > fs/ext4/super.c | 35 +++++++++++++++++------------------
> > fs/ext4/sysfs.c | 4 ++++
> > 4 files changed, 54 insertions(+), 18 deletions(-)
> >
>