RE: Tasks stuck jbd2 for a long time

From: Lu, Davina
Date: Thu Aug 17 2023 - 21:33:04 EST


Hi Bhatnagar,

Looks like this is a similar issue I saw before with fio test (buffered IO with 100 threads), it is also shows "ext4-rsv-conversion" work queue takes lots CPU and make journal update every stuck. It stuck at
[<0>] do_get_write_access+0x291/0x350 [jbd2]
[<0>] jbd2_journal_get_write_access+0x67/0x90 [jbd2]
[<0>] __ext4_journal_get_write_access+0x44/0x90 [ext4]
[<0>] ext4_reserve_inode_write+0x83/0xc0 [ext4]
[<0>] __ext4_mark_inode_dirty+0x50/0x120 [ext4]
[<0>] ext4_convert_unwritten_extents+0x179/0x220 [ext4]
[<0>] ext4_convert_unwritten_io_end_vec+0x64/0xe0 [ext4]
[<0>] ext4_do_flush_completed_IO.isra.0+0xf5/0x190 [ext4]
[<0>] process_one_work+0x1b0/0x350
[<0>] worker_thread+0x49/0x310
[<0>] kthread+0x11b/0x140
[<0>] ret_from_fork+0x22/0x30

And show the lock related to
&journal→j_state_lock-W: waittime-avg 178.75us, holdtime-avg 4.84 us.
&journal→j_state_lock-R: waittime-avg 1269.72us, holdtime-avg 11.07 us.

There is a patch and see if this is the same issue? this is not the finial patch since there may have some issue from Ted. I will forward that email to you in a different loop. I didn't continue on this patch that time since we thought is might not be the real case in RDS.

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 091db733834e..b3c7544798b8 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -5212,7 +5213,9 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
* concurrency isn't really necessary. Limit it to 1.
*/
EXT4_SB(sb)->rsv_conversion_wq =
- alloc_workqueue("ext4-rsv-conversion", WQ_MEM_RECLAIM | WQ_UNBOUND, 1);
+ alloc_workqueue("ext4-rsv-conversion",
+ WQ_MEM_RECLAIM | WQ_UNBOUND | __WQ_ORDERED,
+ num_active_cpus() > 1 ? num_active_cpus() : 1);

Thanks
Davina
-----Original Message-----
From: Bhatnagar, Rishabh <risbhat@xxxxxxxxxx>
Sent: Friday, August 18, 2023 4:59 AM
To: Jan Kara <jack@xxxxxxx>
Cc: Theodore Ts'o <tytso@xxxxxxx>; jack@xxxxxxxx; linux-ext4@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; gregkh@xxxxxxxxxxxxxxxxxxx; Park, SeongJae <sjpark@xxxxxxxxxx>
Subject: Re: Tasks stuck jbd2 for a long time


On 8/17/23 3:49 AM, Jan Kara wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.
>
>
>
> On Wed 16-08-23 15:53:05, Bhatnagar, Rishabh wrote:
>> I collected dump and looked at some processes that were stuck in
>> uninterruptible sleep.These are from upstream stable tree:
>> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree
>> /?h=linux-5.10.y
>> (5.10.191)
>>
>> One of them is the journal thread that is waiting for some other
>> thread to close transaction handle.
>>
>> PID: 10642 TASK: ffff9768823f4000 CPU: 37 COMMAND: "jbd2/md0-8"
>> #0 [ffffbd6c40c17c60] __schedule+617 at ffffffffbb912df9
>> #1 [ffffbd6c40c17cf8] schedule+60 at ffffffffbb91330c
>> #2 [ffffbd6c40c17d08] jbd2_journal_commit_transaction+877 at
>> ffffffffc016b90d [jbd2] (/home/ec2-user/linux/fs/jbd2/commit.c:497)
>> #3 [ffffbd6c40c17ea0] kjournald2+282 at ffffffffc01723ba [jbd2]
>> (/home/ec2-user/linux/fs/jbd2/journal.c:214)
>> #4 [ffffbd6c40c17f10] kthread+279 at ffffffffbb0b9167
>> #5 [ffffbd6c40c17f50] ret_from_fork+34 at ffffffffbb003802
> Yes, correct. This is waiting for transaction->t_updates to drop to 0.
>
>> One of threads that have started the handle and waiting for journal
>> to commit and unlock the current transaction. This stack only shows
>> ext4lazyinit but with lazyinit disabled we have seen other threads
>> stuck in same place.
>>
>> PID: 10644 TASK: ffff976901010000 CPU: 37 COMMAND: "ext4lazyinit"
>> #0 [ffffbd6c40c1fbe0] __schedule+617 at ffffffffbb912df9
>> #1 [ffffbd6c40c1fc78] schedule+60 at ffffffffbb91330c
>> #2 [ffffbd6c40c1fc88] wait_transaction_locked+137 at
>> ffffffffc0168089 [jbd2] (/home/ec2-user/linux/fs/jbd2/transaction.c:184)
>> #3 [ffffbd6c40c1fcd8] add_transaction_credits+62 at
>> ffffffffc016813e [jbd2]
>> (/home/ec2-user/linux/fs/jbd2/transaction.c:241)
>> #4 [ffffbd6c40c1fd30] start_this_handle+533 at ffffffffc0168615
>> [jbd2]
>> (/home/ec2-user/linux/fs/jbd2/transaction.c:416)
>> #5 [ffffbd6c40c1fdc0] jbd2__journal_start+244 at ffffffffc0168dc4 [jbd2]
>> #6 [ffffbd6c40c1fe00] __ext4_journal_start_sb+250 at
>> ffffffffc02ef65a [ext4]
>> #7 [ffffbd6c40c1fe40] ext4_init_inode_table+190 at ffffffffc0302ace [ext4]
>> #8 [ffffbd6c40c1feb0] ext4_lazyinit_thread+906 at ffffffffc033ec9a [ext4]
>> #9 [ffffbd6c40c1ff10] kthread+279 at ffffffffbb0b9167
>> #10 [ffffbd6c40c1ff50] ret_from_fork+34 at ffffffffbb003802
> This thread actually didn't start a transaction. It is *trying* to
> start a transaction but it has failed and we are now waiting for
> transaction commit to proceed (i.e., for jbd2/md0-8 process). So this
> isn't the process jbd2 is waiting for.
>
>> To replicate the download scenario i'm just using dd to copy random
>> data to disk. I launch a bunch of threads and try to stress the
>> system. Many of those threads seem to be stuck in
>> balance_dirty_pages_ratelimited as can be seen below.
>>
>> PID: 10709 TASK: ffff9769016f8000 CPU: 25 COMMAND: "dd"
>> #0 [ffffbd6c40dafa48] __schedule+617 at ffffffffbb912df9
>> #1 [ffffbd6c40dafae0] schedule+60 at ffffffffbb91330c
>> #2 [ffffbd6c40dafaf0] schedule_timeout+570 at ffffffffbb916a7a
>> #3 [ffffbd6c40dafb68] io_schedule_timeout+25 at ffffffffbb913619
>> ((inlined
>> by) io_schedule_finish at /home/ec2-user/linux/kernel/sched/core.c:6274)
>> #4 [ffffbd6c40dafb80] balance_dirty_pages+654 at ffffffffbb2367ce
>> (/home/ec2-user/linux/mm/page-writeback.c:1799)
>> #5 [ffffbd6c40dafcf0] balance_dirty_pages_ratelimited+763 at
>> ffffffffbb23752b (/home/ec2-user/linux/mm/page-writeback.c:1926)
>> #6 [ffffbd6c40dafd18] generic_perform_write+308 at ffffffffbb22af44
>> (/home/ec2-user/linux/mm/filemap.c:3370)
>> #7 [ffffbd6c40dafd88] ext4_buffered_write_iter+161 at
>> ffffffffc02fcba1 [ext4] (/home/ec2-user/linux/fs/ext4/file.c:273)
>> #8 [ffffbd6c40dafdb8] ext4_file_write_iter+96 at ffffffffc02fccf0 [ext4]
>> #9 [ffffbd6c40dafe40] new_sync_write+287 at ffffffffbb2e0c0f
>> #10 [ffffbd6c40dafec8] vfs_write+481 at ffffffffbb2e3161
>> #11 [ffffbd6c40daff00] ksys_write+165 at ffffffffbb2e3385
>> #12 [ffffbd6c40daff40] do_syscall_64+51 at ffffffffbb906213
>> #13 [ffffbd6c40daff50] entry_SYSCALL_64_after_hwframe+103 at
>> ffffffffbba000df
> Yes, this is waiting for page writeback to reduce amount of dirty
> pages in the pagecache. We are not holding transaction handle during
> this wait so this is also not the task jbd2 is waiting for.
>
>> There are other dd threads that are trying to read and are handling
>> page fault. These are in runnable state and not uninterruptible sleep.
>>
>> PID: 14581 TASK: ffff97c3cfdbc000 CPU: 29 COMMAND: "dd"
>> #0 [ffffbd6c4a1d3598] __schedule+617 at ffffffffbb912df9
>> #1 [ffffbd6c4a1d3630] _cond_resched+38 at ffffffffbb9133e6
>> #2 [ffffbd6c4a1d3638] shrink_page_list+126 at ffffffffbb2412fe
>> #3 [ffffbd6c4a1d36c8] shrink_inactive_list+478 at ffffffffbb24441e
>> #4 [ffffbd6c4a1d3768] shrink_lruvec+957 at ffffffffbb244e3d
>> #5 [ffffbd6c4a1d3870] shrink_node+552 at ffffffffbb2452a8
>> #6 [ffffbd6c4a1d38f0] do_try_to_free_pages+201 at ffffffffbb245829
>> #7 [ffffbd6c4a1d3940] try_to_free_pages+239 at ffffffffbb246c0f
>> #8 [ffffbd6c4a1d39d8] __alloc_pages_slowpath.constprop.114+913 at
>> ffffffffbb28d741
>> #9 [ffffbd6c4a1d3ab8] __alloc_pages_nodemask+679 at
>> ffffffffbb28e2e7
>> #10 [ffffbd6c4a1d3b28] alloc_pages_vma+124 at ffffffffbb2a734c
>> #11 [ffffbd6c4a1d3b68] handle_mm_fault+3999 at ffffffffbb26de2f
>> #12 [ffffbd6c4a1d3c28] exc_page_fault+708 at ffffffffbb909c84
>> #13 [ffffbd6c4a1d3c80] asm_exc_page_fault+30 at ffffffffbba00b4e
>> #14 [ffffbd6c4a1d3d30] copyout+28 at ffffffffbb5160bc
>> #15 [ffffbd6c4a1d3d38] _copy_to_iter+158 at ffffffffbb5188de
>> #16 [ffffbd6c4a1d3d98] get_random_bytes_user+136 at ffffffffbb644608
>> #17 [ffffbd6c4a1d3e48] new_sync_read+284 at ffffffffbb2e0a5c
>> #18 [ffffbd6c4a1d3ed0] vfs_read+353 at ffffffffbb2e2f51
>> #19 [ffffbd6c4a1d3f00] ksys_read+165 at ffffffffbb2e3265
>> #20 [ffffbd6c4a1d3f40] do_syscall_64+51 at ffffffffbb906213
>> #21 [ffffbd6c4a1d3f50] entry_SYSCALL_64_after_hwframe+103 at
>> ffffffffbba000df
> This process is in direct reclaim trying to free more memory. It
> doesn't have transaction handle started so jbd2 also isn't waiting for
> this process.
>
>>>> Can low available memory be a reason for a thread to not be able to
>>>> close the transaction handle for a long time?
>>>> Maybe some writeback thread starts the handle but is not able to
>>>> complete writeback?
>>> Well, even that would be a bug but low memory conditions are
>>> certainly some of less tested paths so it is possible there's a bug lurking there.
>> Amongst the things we have tested 2 things seem to give good improvements.
>>
>> One is disabling journalling. We don't see any stuck tasks. System
>> becomes slow but eventually recovers. But its not something we want to disable.
>>
>> Other is enabling swap memory. Adding some swap memory also avoids
>> system going into low memory state and system doesn't freeze.
> OK, these are just workarounds. The question really is which process
> holds the transaction handle jbd2 thread is waiting for. It is none of
> the processes you have shown above. Since you have the crashdump, you
> can also search all the processes and find those which have non-zero
> task->journal_info. And from these processes you can select those
> task->where journal_info points to an object from jbd2_handle_cache
> task->and then you
> can verify whether the handles indeed point (through
> handle->h_transaction) to the transaction jbd2 thread is trying to
> commit. After you've identified such task it is interesting to see what is it doing...
>
Hi Jan

I think I found the thread that is holding the transaction handle. It seems to be in runnable state though.
It has the journal_info set to the journal handle that has the matching transaction as the journal's running transaction.
Here is the associated stack trace. It is converting unwritten extents to extents.

PID: 287    TASK: ffff976801890000  CPU: 20  COMMAND: "kworker/u96:35"
 #0 [ffffbd6c40b3f498] __schedule+617 at ffffffffbb912df9
 #1 [ffffbd6c40b3f530] _cond_resched+38 at ffffffffbb9133e6
 #2 [ffffbd6c40b3f538] shrink_lruvec+670 at ffffffffbb244d1e
 #3 [ffffbd6c40b3f640] _cond_resched+21 at ffffffffbb9133d5
 #4 [ffffbd6c40b3f648] shrink_node+552 at ffffffffbb2452a8
 #5 [ffffbd6c40b3f6c8] do_try_to_free_pages+201 at ffffffffbb245829
 #6 [ffffbd6c40b3f718] try_to_free_pages+239 at ffffffffbb246c0f
 #7 [ffffbd6c40b3f7b0] __alloc_pages_slowpath.constprop.114+913 at
ffffffffbb28d741
 #8 [ffffbd6c40b3f890] __alloc_pages_nodemask+679 at ffffffffbb28e2e7
 #9 [ffffbd6c40b3f900] allocate_slab+726 at ffffffffbb2b0886
#10 [ffffbd6c40b3f958] ___slab_alloc+1173 at ffffffffbb2b3ff5
#11 [ffffbd6c40b3f988] insert_revoke_hash+37 at ffffffffc016f435 [jbd2]
(/home/ec2-user/linux/fs/jbd2/revoke.c:146)
#12 [ffffbd6c40b3f9b8] kmem_cache_free+924 at ffffffffbb2b712c )(inlined
by) slab_alloc at /home/ec2-user/linux/mm/slub.c:2904)
#13 [ffffbd6c40b3fa18] insert_revoke_hash+37 at ffffffffc016f435 [jbd2]
(/home/ec2-user/linux/fs/jbd2/revoke.c:146)
#14 [ffffbd6c40b3fa40] kmem_cache_alloc+928 at ffffffffbb2b4590
(/home/ec2-user/linux/mm/slub.c:290)
#15 [ffffbd6c40b3fa78] insert_revoke_hash+37 at ffffffffc016f435 [jbd2]
(/home/ec2-user/linux/fs/jbd2/revoke.c:146)
#16 [ffffbd6c40b3faa0] __ext4_forget+338 at ffffffffc02efb32 [ext4]
(/home/ec2-user/linux/fs/ext4/ext4_jbd2.c:298)
#17 [ffffbd6c40b3fae0] ext4_free_blocks+2437 at ffffffffc031fd55 [ext4]
(/home/ec2-user/linux/fs/ext4/mballoc.c:5709 (discriminator 2))
#18 [ffffbd6c40b3fbb0] ext4_ext_handle_unwritten_extents+596 at
ffffffffc02f56a4 [ext4] ((inlined by) ext4_ext_handle_unwritten_extents at /home/ec2-user/linux/fs/ext4/extents.c:3892)
#19 [ffffbd6c40b3fc98] ext4_ext_map_blocks+1325 at ffffffffc02f710d [ext4] (/home/ec2-user/linux/fs/ext4/extents.c:4165)
#20 [ffffbd6c40b3fd60] ext4_map_blocks+813 at ffffffffc030bd0d [ext4]
(/home/ec2-user/linux/fs/ext4/inode.c:659)
#21 [ffffbd6c40b3fdd0] ext4_convert_unwritten_extents+303 at ffffffffc02f8adf [ext4] (/home/ec2-user/linux/fs/ext4/extents.c:4810)
#22 [ffffbd6c40b3fe28] ext4_convert_unwritten_io_end_vec+95 at ffffffffc02f8c5f [ext4] (/home/ec2-user/linux/fs/ext4/extents.c:4850)
#23 [ffffbd6c40b3fe58] ext4_end_io_rsv_work+269 at ffffffffc032c3fd [ext4] ((inlined by) ext4_do_flush_completed_IO at
/home/ec2-user/linux/fs/ext4/page-io.c:262)
#24 [ffffbd6c40b3fe98] process_one_work+405 at ffffffffbb0b2725
#25 [ffffbd6c40b3fed8] worker_thread+48 at ffffffffbb0b2920
#26 [ffffbd6c40b3ff10] kthread+279 at ffffffffbb0b9167
#27 [ffffbd6c40b3ff50] ret_from_fork+34 at ffffffffbb003802

Thanks
Rishabh