Re: Lockup of (raid5 or raid6) + vdo after taking out a disk under load

From: Konstantin Kharlamov
Date: Fri Jul 12 2024 - 08:12:16 EST


On Fri, 2024-07-12 at 16:35 +0800, Yu Kuai wrote:
> Hi,
>
> 在 2024/07/12 14:53, Konstantin Kharlamov 写道:
>
> > # Steps to reproduce
> >
> > 1. Create raid5 LV + VDO by executing a `./mk_lvm_raid5.sh
> > /dev/sdX1
> > /dev/sdY1 /dev/sdZ1` where `mk_lvm_raid5.sh` has the following
> > content:
> >
> >      #!/bin/bash
> >
> >      set -exu
> >
> >      if [ "$#" -ne 3 ]; then
> >          echo "Wrong number of parameters.
> >      Usage: $(basename $0) disk1 disk2 disk3"
> >          exit 1
> >      fi
> >
> >      # create the VG
> >      pvcreate -f "$1" "$2" "$3"
> >      vgcreate p_r5 "$1" "$2" "$3"
> >
> >      # create the LV
> >      lvcreate --type raid5 -i 2 -L 21474836480b -I 64K -n
> > vdo_internal_deco_vol p_r5 -y
> >      lvconvert -y --type vdo-pool --virtualsize 107374182400B -n
> > deco_vol p_r5/vdo_internal_deco_vol
> >
> > 2. Start load by executing `fio ./fio-30%write.fio` with the `fio-
> > 30%write.fio` having the following content:
> >
> >      [test IOPS]
> >      blocksize=8k
> >      filename=/dev/p_r5/deco_vol
> >      filesize=100G
> >      direct=1
> >      buffered=0
> >      ioengine=libaio
> >      iodepth=32
> >      rw=randrw
> >      rwmixwrite=30
> >      numjobs=4
> >      group_reporting
> >      time_based
> >      runtime=99h
> >      clat_percentiles=0
> >      unlink=1
> >
> > 3. Wait for about a minute
> > 4. Remove a disk of the volume group, either physically, or by
> > turning
> > off jbod slot's power (DO NOT use /…/device/delete).
>
> Looks like this is because IO is failed from raid level, and then dm
> level keep retry this IO(This will be related to the step 4), hence
> raid5d stuck in the loop to hanlde new IO.
>
> Can you give the following patch a test to confirm this?
>
> Thanks!
> Kuai
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index c14cf2410365..a0f784cd664c 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -6776,7 +6776,9 @@ static void raid5d(struct md_thread *thread)
>
>                  while ((bio = remove_bio_from_retry(conf, &offset)))
> {
>                          int ok;
> +
>                          spin_unlock_irq(&conf->device_lock);
> +                       cond_resched();
>                          ok = retry_aligned_read(conf, bio, offset);
>                          spin_lock_irq(&conf->device_lock);
>                          if (!ok)
> @@ -6790,11 +6792,11 @@ static void raid5d(struct md_thread *thread)
>                          break;
>                  handled += batch_size;
>
> -               if (mddev->sb_flags & ~(1 << MD_SB_CHANGE_PENDING)) {
> -                       spin_unlock_irq(&conf->device_lock);
> +               spin_unlock_irq(&conf->device_lock);
> +               if (mddev->sb_flags & ~(1 << MD_SB_CHANGE_PENDING))
>                          md_check_recovery(mddev);
> -                       spin_lock_irq(&conf->device_lock);
> -               }
> +               cond_resched();
> +               spin_lock_irq(&conf->device_lock);
>          }
>          pr_debug("%d stripes handled\n", handled);

Thank you for the quick answer! FTR, I had applied the diff by manually
editing the code because tabs got converted to spaces in the text, so
it wasn't applicable. Hopefully I got everything correct. (in
retrospection, I could've used `--ignore-whitespace`. Oh, well…)

So, there are both good and bad news.

Good news: you diff seems to have fixed the problem! I would have to
test more extensively in another environment to be completely sure, but
by following the minimal steps-to-reproduce I can no longer reproduce
the problem, so it seems to have fixed the problem.

Bad news: there's a new lockup now 😄 This one seems to happen after
the disk is returned back; unless the action of returning back matches
accidentally the appearing stacktraces, which still might be possible
even though I re-tested multiple times. It's because the traces
(below) seems not to always appear. However, even when traces do not
appear, IO load on the fio that's running in the background drops to
zero, so something seems definitely wrong.

[ 475.353727] sd 0:0:25:0: [sdbu] Attached SCSI disk
[ 615.173084] INFO: task dm-vdo0:bioQ0:11500 blocked for more than 122 seconds.
[ 615.173326] Not tainted 6.9.8-bstrg #4
[ 615.173525] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 615.173735] task:dm-vdo0:bioQ0 state:D stack:0 pid:11500 tgid:11500 ppid:2 flags:0x00004000
[ 615.173961] Call Trace:
[ 615.174180] <TASK>
[ 615.174402] __schedule+0x376/0xb30
[ 615.174636] schedule+0x2f/0x110
[ 615.174862] schedule_timeout+0x15a/0x170
[ 615.175092] ? _raw_spin_unlock_irqrestore+0x1e/0x40
[ 615.175326] ? __wake_up+0x40/0x60
[ 615.175563] wait_woken+0x73/0x80
[ 615.175793] raid5_make_request+0x5dc/0x12f0 [raid456]
[ 615.176033] ? kmem_cache_alloc+0x4e/0x2f0
[ 615.176257] ? __pfx_woken_wake_function+0x10/0x10
[ 615.176496] md_handle_request+0x15e/0x2a0
[ 615.176730] raid_map+0x31/0x60 [dm_raid]
[ 615.176966] __map_bio+0x181/0x1b0
[ 615.177203] dm_submit_bio+0x194/0x550
[ 615.177440] __submit_bio+0x97/0x130
[ 615.177680] submit_bio_noacct_nocheck+0x18c/0x3c0
[ 615.177925] submit_data_vio+0xba/0x100
[ 615.178171] work_queue_runner+0x215/0x290
[ 615.178416] ? __pfx_autoremove_wake_function+0x10/0x10
[ 615.178664] ? __pfx_work_queue_runner+0x10/0x10
[ 615.178914] kthread+0xff/0x130
[ 615.179168] ? __pfx_kthread+0x10/0x10
[ 615.179422] ret_from_fork+0x30/0x50
[ 615.179687] ? __pfx_kthread+0x10/0x10
[ 615.179947] ret_from_fork_asm+0x1a/0x30
[ 615.180215] </TASK>
[ 615.180475] INFO: task dm-vdo0:bioQ1:11501 blocked for more than 122 seconds.
[ 615.180750] Not tainted 6.9.8-bstrg #4
[ 615.181028] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 615.181319] task:dm-vdo0:bioQ1 state:D stack:0 pid:11501 tgid:11501 ppid:2 flags:0x00004000
[ 615.181624] Call Trace:
[ 615.181926] <TASK>
[ 615.182228] __schedule+0x376/0xb30
[ 615.182529] schedule+0x2f/0x110
[ 615.182821] schedule_timeout+0x15a/0x170
[ 615.183117] ? _raw_spin_unlock_irqrestore+0x1e/0x40
[ 615.183410] ? __wake_up+0x40/0x60
[ 615.183701] wait_woken+0x73/0x80
[ 615.183987] raid5_make_request+0x5dc/0x12f0 [raid456]
[ 615.184280] ? mempool_alloc+0x61/0x1b0
[ 615.184565] ? kmem_cache_alloc+0x4e/0x2f0
[ 615.184845] ? __pfx_woken_wake_function+0x10/0x10
[ 615.185126] md_handle_request+0x15e/0x2a0
[ 615.185402] raid_map+0x31/0x60 [dm_raid]
[ 615.185695] __map_bio+0x181/0x1b0
[ 615.185979] dm_submit_bio+0x194/0x550
[ 615.186255] __submit_bio+0x97/0x130
[ 615.186529] submit_bio_noacct_nocheck+0x18c/0x3c0
[ 615.186804] submit_data_vio+0xba/0x100
[ 615.187076] work_queue_runner+0x215/0x290
[ 615.187350] ? __pfx_autoremove_wake_function+0x10/0x10
[ 615.187628] ? __pfx_work_queue_runner+0x10/0x10
[ 615.187905] kthread+0xff/0x130
[ 615.188180] ? __pfx_kthread+0x10/0x10
[ 615.188456] ret_from_fork+0x30/0x50
[ 615.188729] ? __pfx_kthread+0x10/0x10
[ 615.189002] ret_from_fork_asm+0x1a/0x30
[ 615.189276] </TASK>
[ 615.189541] INFO: task dm-vdo0:bioQ3:11504 blocked for more than 122 seconds.
[ 615.189823] Not tainted 6.9.8-bstrg #4
[ 615.190102] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 615.190392] task:dm-vdo0:bioQ3 state:D stack:0 pid:11504 tgid:11504 ppid:2 flags:0x00004000
[ 615.190694] Call Trace:
[ 615.190989] <TASK>
[ 615.191283] __schedule+0x376/0xb30
[ 615.191580] schedule+0x2f/0x110
[ 615.191875] schedule_timeout+0x15a/0x170
[ 615.192168] ? _raw_spin_unlock_irqrestore+0x1e/0x40
[ 615.192463] ? __wake_up+0x40/0x60
[ 615.192757] wait_woken+0x73/0x80
[ 615.193057] raid5_make_request+0x5dc/0x12f0 [raid456]
[ 615.193365] ? mempool_alloc+0x61/0x1b0
[ 615.193663] ? kmem_cache_alloc+0x4e/0x2f0
[ 615.193961] ? __pfx_woken_wake_function+0x10/0x10
[ 615.194257] md_handle_request+0x15e/0x2a0
[ 615.194548] raid_map+0x31/0x60 [dm_raid]
[ 615.194836] __map_bio+0x181/0x1b0
[ 615.195114] dm_submit_bio+0x194/0x550
[ 615.195386] __submit_bio+0x97/0x130
[ 615.195657] submit_bio_noacct_nocheck+0x18c/0x3c0
[ 615.195932] submit_data_vio+0xba/0x100
[ 615.196206] work_queue_runner+0x215/0x290
[ 615.196488] ? __pfx_autoremove_wake_function+0x10/0x10
[ 615.196767] ? __pfx_work_queue_runner+0x10/0x10
[ 615.197044] kthread+0xff/0x130
[ 615.197321] ? __pfx_kthread+0x10/0x10
[ 615.197596] ret_from_fork+0x30/0x50
[ 615.197868] ? __pfx_kthread+0x10/0x10
[ 615.198141] ret_from_fork_asm+0x1a/0x30
[ 615.198415] </TASK>