Re: [PATCH -next 0/9] dm-raid, md/raid: fix v6.7 regressions part2

From: Yu Kuai
Date: Mon Mar 04 2024 - 06:53:18 EST

Next message: Dhruva Gole: "Re: [PATCH 2/2] OPP: debugfs: Fix warning around icc_get_name()"
Previous message: Andy Shevchenko: "Re: [PATCH 4/4] iio: pressure: Add triggered buffer support for BMP280 driver"
In reply to: Xiao Ni: "Re: [PATCH -next 0/9] dm-raid, md/raid: fix v6.7 regressions part2"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

在 2024/03/04 19:06, Xiao Ni 写道:

On Mon, Mar 4, 2024 at 4:27 PM Xiao Ni <xni@xxxxxxxxxx> wrote:

On Mon, Mar 4, 2024 at 9:25 AM Xiao Ni <xni@xxxxxxxxxx> wrote:

On Mon, Mar 4, 2024 at 9:24 AM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:

Hi,

在 2024/03/04 9:07, Yu Kuai 写道:

Hi,

在 2024/03/03 21:16, Xiao Ni 写道:

Hi all

There is a error report from lvm regression tests. The case is
lvconvert-raid-reshape-stripes-load-reload.sh. I saw this error when I
tried to fix dmraid regression problems too. In my patch set, after
reverting ad39c08186f8a0f221337985036ba86731d6aafe (md: Don't register
sync_thread for reshape directly), this problem doesn't appear.

Hi Kuai

How often did you see this tes failed? I'm running the tests for over
two days now, for 30+ rounds, and this test never fail in my VM.

I ran 5 times and it failed 2 times just now.

Take a quick look, there is still a path from raid10 that
MD_RECOVERY_FROZEN can be cleared, and in theroy this problem can be
triggered. Can you test the following patch on the top of this set?
I'll keep running the test myself.

Sure, I'll give the result later.

Hi all

It's not stable to reproduce this. After applying this raid10 patch it
failed once 28 times. Without the raid10 patch, it failed once 30
times, but it failed frequently this morning.

Hi all

After running 152 times with kernel 6.6, the problem can appear too.
So it can return the state of 6.6. This patch set can make this
problem appear quickly.

I verified in my VM that after testing 100+ times, this problem can both
triggered with v6.6 and v6.8-rc5 + this set.

I think we can merge this patchset, and figure out why the test can fail
later.

Thanks,
Kuai

Best Regards
Xiao

Regards
Xiao

Regards
Xiao

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index a5f8419e2df1..7ca29469123a 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -4575,7 +4575,8 @@ static int raid10_start_reshape(struct mddev *mddev)
return 0;

abort:
- mddev->recovery = 0;
+ if (mddev->gendisk)
+ mddev->recovery = 0;
spin_lock_irq(&conf->device_lock);
conf->geo = conf->prev;
mddev->raid_disks = conf->geo.raid_disks;

Thanks,
Kuai

Thanks,
Kuai

I put the log in the attachment.

On Fri, Mar 1, 2024 at 6:03 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:

From: Yu Kuai <yukuai3@xxxxxxxxxx>

link to part1:
https://lore.kernel.org/all/CAPhsuW7u1UKHCDOBDhD7DzOVtkGemDz_QnJ4DUq_kSN-Q3G66Q@xxxxxxxxxxxxxx/

part1 contains fixes for deadlocks for stopping sync_thread

This set contains fixes:
- reshape can start unexpected, cause data corruption, patch 1,5,6;
- deadlocks that reshape concurrent with IO, patch 8;
- a lockdep warning, patch 9;

I'm runing lvm2 tests with following scripts with a few rounds now,

for t in `ls test/shell`; do
if cat test/shell/$t | grep raid &> /dev/null; then
make check T=shell/$t
fi
done

There are no deadlock and no fs corrupt now, however, there are still
four
failed tests:

### failed: [ndev-vanilla] shell/lvchange-raid1-writemostly.sh
### failed: [ndev-vanilla] shell/lvconvert-repair-raid.sh
### failed: [ndev-vanilla] shell/lvcreate-large-raid.sh
### failed: [ndev-vanilla] shell/lvextend-raid.sh

And failed reasons are the same:

## ERROR: The test started dmeventd (147856) unexpectedly

I have no clue yet, and it seems other folks doesn't have this issue.

Yu Kuai (9):
md: don't clear MD_RECOVERY_FROZEN for new dm-raid until resume
md: export helpers to stop sync_thread
md: export helper md_is_rdwr()
md: add a new helper reshape_interrupted()
dm-raid: really frozen sync_thread during suspend
md/dm-raid: don't call md_reap_sync_thread() directly
dm-raid: add a new helper prepare_suspend() in md_personality
dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io
concurrent with reshape
dm-raid: fix lockdep waring in "pers->hot_add_disk"

drivers/md/dm-raid.c | 93 ++++++++++++++++++++++++++++++++++----------
drivers/md/md.c | 73 ++++++++++++++++++++++++++--------
drivers/md/md.h | 38 +++++++++++++++++-
drivers/md/raid5.c | 32 ++++++++++++++-
4 files changed, 196 insertions(+), 40 deletions(-)

--
2.39.2

.

.

Next message: Dhruva Gole: "Re: [PATCH 2/2] OPP: debugfs: Fix warning around icc_get_name()"
Previous message: Andy Shevchenko: "Re: [PATCH 4/4] iio: pressure: Add triggered buffer support for BMP280 driver"
In reply to: Xiao Ni: "Re: [PATCH -next 0/9] dm-raid, md/raid: fix v6.7 regressions part2"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]