Re: md_raid: mdX_raid6 looping after sync_action "check" to "idle" transition

From: Yaohui Hu
Date: Thu May 30 2024 - 04:12:28 EST


Hi,

After applied suggested 7 commits to kernel v6.1. There are still deadlocks in raid456 such as following. Do we have other follow up patch to fix this issue?

[2378440.310330] INFO: task systemd-journal:1415 blocked for more than 120 seconds.
[2378440.318649] Tainted: G OE K 6.1.51-0digitalocean2-generic #0digitalocean2
[2378440.328327] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2378440.337306] task:systemd-journal state:D stack:0 pid:1415 ppid:1 flags:0x00000006
[2378440.337310] Call Trace:
[2378440.337312] <TASK>
[2378440.337315] __schedule+0x1f5/0x5c0
[2378440.337324] schedule+0x53/0xb0
[2378440.337327] io_schedule+0x42/0x70
[2378440.337329] folio_wait_bit_common+0x11f/0x310
[2378440.337335] ? filemap_invalidate_unlock_two+0x40/0x40
[2378440.337338] ? ext4_da_map_blocks.constprop.0+0x370/0x370
[2378440.337342] block_page_mkwrite+0x113/0x160
[2378440.337346] ext4_page_mkwrite+0x233/0x6d0
[2378440.337349] do_page_mkwrite+0x4d/0xe0
[2378440.337352] do_wp_page+0x1d9/0x440
[2378440.337355] __handle_mm_fault+0x50a/0x5d0
[2378440.337358] handle_mm_fault+0xe3/0x2e0
[2378440.337359] do_user_addr_fault+0x187/0x560
[2378440.337363] exc_page_fault+0x6c/0x150
[2378440.337366] asm_exc_page_fault+0x22/0x30
[2378440.337371] RIP: 0033:0x7f2b4b7c9e7a
[2378440.337377] RSP: 002b:00007ffcd6387340 EFLAGS: 00010202
[2378440.337379] RAX: 0000000004725c68 RBX: 000056420cb077f0 RCX: 00007f2b3a7983b0
[2378440.337381] RDX: 00007f2b3fb25c68 RSI: 000056420cb077f0 RDI: 00007f2b3a7983d8
[2378440.337382] RBP: 000056420caca8f0 R08: 00000000003983b0 R09: 00007ffcd6387370
[2378440.337383] R10: 86c0275e98fea605 R11: 0000000000244a4c R12: 000000000000000a
[2378440.337384] R13: e4bdb9e57ca6925c R14: 0000000000000000 R15: 00007ffcd6387378
[2378440.337386] </TASK>
[2378440.337587] INFO: task md8_resync:3952093 blocked for more than 120 seconds.
[2378440.345699] Tainted: G OE K 6.1.51-0digitalocean2-generic #0digitalocean2
[2378440.355371] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2378440.364359] task:md8_resync state:D stack:0 pid:3952093 ppid:2 flags:0x00004000
[2378440.364362] Call Trace:
[2378440.364363] <TASK>
[2378440.364364] __schedule+0x1f5/0x5c0
[2378440.364368] schedule+0x53/0xb0
[2378440.364372] raid5_get_active_stripe+0x1f6/0x290 [raid456]
[2378440.364381] ? destroy_sched_domains_rcu+0x30/0x30
[2378440.364385] raid5_sync_request+0x289/0x2c0 [raid456]
[2378440.364392] ? is_mddev_idle+0xb5/0x115
[2378440.364395] md_do_sync.cold+0x3eb/0x96b
[2378440.364398] ? destroy_sched_domains_rcu+0x30/0x30
[2378440.364400] md_thread+0xa9/0x160
[2378440.364406] ? super_90_load.part.0+0x300/0x300
[2378440.364408] kthread+0xb9/0xe0
[2378440.364412] ? kthread_complete_and_exit+0x20/0x20
[2378440.364414] ret_from_fork+0x1f/0x30
[2378440.364418] </TASK>

Thanks