Re: Infiniate systemd loop when power off the machine with multiple MD RAIDs

From: Guoqing Jiang
Date: Tue Aug 22 2023 - 08:47:44 EST


Hi Acelan,

On 8/22/23 16:13, AceLan Kao wrote:
Hello,
The issue is reproducible with IMSM metadata too, around 20% of reboot
hangs. I will try to raise the priority in the bug because it is valid
high- the base functionality of the system is affected.
Since it it reproducible from your side, is it possible to turn the
reproduce steps into a test case
given the importance?
I didn't try to reproduce it locally yet because customer was able to
bisect the regression and it pointed them to the same patch so I connected it
and asked author to take a look first. At a first glance, I wanted to get
community voice to see if it is not something obvious.

So far I know, customer is creating 3 IMSM raid arrays, one is the system
volume and do a reboot and it sporadically fails (around 20%). That is all.

I guess If all arrays are set with MD_DELETED flag, then reboot might
hang, not sure whether
below (maybe need to flush wq as well before list_del) helps or not,
just FYI.

@@ -9566,8 +9566,10 @@ static int md_notify_reboot(struct notifier_block
*this,

spin_lock(&all_mddevs_lock);
list_for_each_entry_safe(mddev, n, &all_mddevs, all_mddevs) {
- if (!mddev_get(mddev))
+ if (!mddev_get(mddev)) {
+ list_del(&mddev->all_mddevs);
continue;
+ }

My suggestion is delete the list node under this scenario, did you try above?

I am still not able to reproduce this, probably due to differences in the
timing. Maybe we only need something like:

diff --git i/drivers/md/md.c w/drivers/md/md.c
index 5c3c19b8d509..ebb529b0faf8 100644
--- i/drivers/md/md.c
+++ w/drivers/md/md.c
@@ -9619,8 +9619,10 @@ static int md_notify_reboot(struct notifier_block
*this,

spin_lock(&all_mddevs_lock);
list_for_each_entry_safe(mddev, n, &all_mddevs, all_mddevs) {
- if (!mddev_get(mddev))
+ if (!mddev_get(mddev)) {
+ need_delay = 1;
continue;
+ }
spin_unlock(&all_mddevs_lock);
if (mddev_trylock(mddev)) {
if (mddev->pers)


Thanks,
Song
I will try to reproduce issue at Intel lab to check this.

Thanks,
Mariusz
Hi Guoqing,

Here is the command how I trigger the issue, have to do it around 10
times to make sure the issue is reproducible

echo "repair" | sudo tee /sys/class/block/md12?/md/sync_action && sudo
grub-reboot "Advanced options for Ubuntu>Ubuntu, with Linux 6.5.0-rc77
06a74159504-dirty" && head -c 1G < /dev/urandom > myfile1 && sleep 180
&& head -c 1G < /dev/urandom > myfile2 && sleep 1 && cat /proc/mdstat
&& sleep 1 && rm myfile1 &&
sudo reboot

Is the issue still reproducible with remove below from cmd?

echo "repair" | sudo tee /sys/class/block/md12?/md/sync_action

Just want to know if resync thread is related with the issue or not.

And the patch to add need_delay doesn't work.

My assumption is that mddev_get always returns NULL, so set need_delay
wouldn't help.

Thanks,
Guoqing