Re: [PATCH -next 0/9] dm-raid, md/raid: fix v6.7 regressions part2
From: Xiao Ni
Date: Mon Mar 04 2024 - 06:06:49 EST
On Mon, Mar 4, 2024 at 4:27 PM Xiao Ni <xni@xxxxxxxxxx> wrote:
>
> On Mon, Mar 4, 2024 at 9:25 AM Xiao Ni <xni@xxxxxxxxxx> wrote:
> >
> > On Mon, Mar 4, 2024 at 9:24 AM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:
> > >
> > > Hi,
> > >
> > > 在 2024/03/04 9:07, Yu Kuai 写道:
> > > > Hi,
> > > >
> > > > 在 2024/03/03 21:16, Xiao Ni 写道:
> > > >> Hi all
> > > >>
> > > >> There is a error report from lvm regression tests. The case is
> > > >> lvconvert-raid-reshape-stripes-load-reload.sh. I saw this error when I
> > > >> tried to fix dmraid regression problems too. In my patch set, after
> > > >> reverting ad39c08186f8a0f221337985036ba86731d6aafe (md: Don't register
> > > >> sync_thread for reshape directly), this problem doesn't appear.
> > > >
> >
> > Hi Kuai
> > > > How often did you see this tes failed? I'm running the tests for over
> > > > two days now, for 30+ rounds, and this test never fail in my VM.
> >
> > I ran 5 times and it failed 2 times just now.
> >
> > >
> > > Take a quick look, there is still a path from raid10 that
> > > MD_RECOVERY_FROZEN can be cleared, and in theroy this problem can be
> > > triggered. Can you test the following patch on the top of this set?
> > > I'll keep running the test myself.
> >
> > Sure, I'll give the result later.
>
> Hi all
>
> It's not stable to reproduce this. After applying this raid10 patch it
> failed once 28 times. Without the raid10 patch, it failed once 30
> times, but it failed frequently this morning.
Hi all
After running 152 times with kernel 6.6, the problem can appear too.
So it can return the state of 6.6. This patch set can make this
problem appear quickly.
Best Regards
Xiao
>
> Regards
> Xiao
> >
> > Regards
> > Xiao
> > >
> > > diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> > > index a5f8419e2df1..7ca29469123a 100644
> > > --- a/drivers/md/raid10.c
> > > +++ b/drivers/md/raid10.c
> > > @@ -4575,7 +4575,8 @@ static int raid10_start_reshape(struct mddev *mddev)
> > > return 0;
> > >
> > > abort:
> > > - mddev->recovery = 0;
> > > + if (mddev->gendisk)
> > > + mddev->recovery = 0;
> > > spin_lock_irq(&conf->device_lock);
> > > conf->geo = conf->prev;
> > > mddev->raid_disks = conf->geo.raid_disks;
> > >
> > > Thanks,
> > > Kuai
> > > >
> > > > Thanks,
> > > > Kuai
> > > >
> > > >>
> > > >> I put the log in the attachment.
> > > >>
> > > >> On Fri, Mar 1, 2024 at 6:03 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:
> > > >>>
> > > >>> From: Yu Kuai <yukuai3@xxxxxxxxxx>
> > > >>>
> > > >>> link to part1:
> > > >>> https://lore.kernel.org/all/CAPhsuW7u1UKHCDOBDhD7DzOVtkGemDz_QnJ4DUq_kSN-Q3G66Q@xxxxxxxxxxxxxx/
> > > >>>
> > > >>>
> > > >>> part1 contains fixes for deadlocks for stopping sync_thread
> > > >>>
> > > >>> This set contains fixes:
> > > >>> - reshape can start unexpected, cause data corruption, patch 1,5,6;
> > > >>> - deadlocks that reshape concurrent with IO, patch 8;
> > > >>> - a lockdep warning, patch 9;
> > > >>>
> > > >>> I'm runing lvm2 tests with following scripts with a few rounds now,
> > > >>>
> > > >>> for t in `ls test/shell`; do
> > > >>> if cat test/shell/$t | grep raid &> /dev/null; then
> > > >>> make check T=shell/$t
> > > >>> fi
> > > >>> done
> > > >>>
> > > >>> There are no deadlock and no fs corrupt now, however, there are still
> > > >>> four
> > > >>> failed tests:
> > > >>>
> > > >>> ### failed: [ndev-vanilla] shell/lvchange-raid1-writemostlysh
> > > >>> ### failed: [ndev-vanilla] shell/lvconvert-repair-raid.sh
> > > >>> ### failed: [ndev-vanilla] shell/lvcreate-large-raid.sh
> > > >>> ### failed: [ndev-vanilla] shell/lvextend-raid.sh
> > > >>>
> > > >>> And failed reasons are the same:
> > > >>>
> > > >>> ## ERROR: The test started dmeventd (147856) unexpectedly
> > > >>>
> > > >>> I have no clue yet, and it seems other folks doesn't have this issue.
> > > >>>
> > > >>> Yu Kuai (9):
> > > >>> md: don't clear MD_RECOVERY_FROZEN for new dm-raid until resume
> > > >>> md: export helpers to stop sync_thread
> > > >>> md: export helper md_is_rdwr()
> > > >>> md: add a new helper reshape_interrupted()
> > > >>> dm-raid: really frozen sync_thread during suspend
> > > >>> md/dm-raid: don't call md_reap_sync_thread() directly
> > > >>> dm-raid: add a new helper prepare_suspend() in md_personality
> > > >>> dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io
> > > >>> concurrent with reshape
> > > >>> dm-raid: fix lockdep waring in "pers->hot_add_disk"
> > > >>>
> > > >>> drivers/md/dm-raid.c | 93 ++++++++++++++++++++++++++++++++++----------
> > > >>> drivers/md/md.c | 73 ++++++++++++++++++++++++++--------
> > > >>> drivers/md/md.h | 38 +++++++++++++++++-
> > > >>> drivers/md/raid5.c | 32 ++++++++++++++-
> > > >>> 4 files changed, 196 insertions(+), 40 deletions(-)
> > > >>>
> > > >>> --
> > > >>> 2.39.2
> > > >>>
> > > >
> > > >
> > > > .
> > > >
> > >