Re: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 xSSD)

From: NeilBrown
Date: Mon Jul 29 2013 - 01:57:17 EST


On Fri, 26 Jul 2013 05:56:51 -0400 "Justin Piszcz" <jpiszcz@xxxxxxxxxxxxxxx>
wrote:

>
>
> -----Original Message-----
> From: NeilBrown [mailto:neilb@xxxxxxx]
> Sent: Thursday, July 25, 2013 8:36 PM
> To: Justin Piszcz
> Cc: linux-kernel@xxxxxxxxxxxxxxx; linux-raid@xxxxxxxxxxxxxxx
> Subject: Re: 3.10.1: echo repair > sync_action causes hang on RAID-1 (2 x
> SSD)
>
> On Thu, 25 Jul 2013 19:10:50 -0400 "Justin Piszcz" <jpiszcz@xxxxxxxxxxxxxxx>
> wrote:
>
> > Did the fix by chance make it into 3.10.3?
>
> No, it looks like it missed again. I gather there was a large inflow of
> patches for -stable in the 3.11-rc1 merge window and Greg has been
> processing
> them in batches. Hopefully in 3.10.4.
>
> The relevant patch is commit 30bc9b53878a9921b02e3 in mainline.
>
> NeilBrown
>
> --
>
> Method to get patch via git and patch kernel:
>
> $ git clone
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
> $ git log |grep 30bc9b53878a9921b02e3
> commit 30bc9b53878a9921b02e3b5bc4283ac1c6de102a
> $ git show 30bc9b53878a9921b02e3b5bc4283ac1c6de102a > /tmp/a
> # patch -p1 < /tmp/a
> patching file drivers/md/raid1.c
> Hunk #1 succeeded at 1848 (offset -1 lines).
> Hunk #2 succeeded at 1886 (offset -1 lines).
> Hunk #3 succeeded at 1915 (offset -1 lines).
>
> Reboot- tested, success, thanks..!
>
> One follow-up question:
> $ cat /sys/block/md1/md/mismatch_cnt
> 314112
> -> On a live RAID-1 (root filesystem) without swap, is it normal to have
> such a high mismatch_cnt even after a repair?
>
> First repair:
> Fri Jul 26 05:30:47 EDT 2013: The meta-device /dev/md1 has mismatch_cnt
> 314112 sectors.
> Second repair:
> Fri Jul 26 05:30:47 EDT 2013: The meta-device /dev/md1 has mismatch_cnt
> 313600 sectors.

Those two lines have exactly the same timestamp and array name but different
mismatch counts. That is very strange.

Did you run two consecutive 'repair's on the one array, both with the patched
kernel? If so and the second mismatch_cnt wasn't zero (or close to
it..maybe) then something is definitely wrong.

NeilBrown


>
> Should I be concerned?
>
>
> Testing the patch:
>
> Personalities : [raid1]
> md1 : active raid1 sdc2[0] sdb2[1]
> 233381376 blocks [2/2] [UU]
> [>....................] check = 0.3% (838976/233381376)
> finish=9.2min speed=419488K/sec
>
> md0 : active raid1 sdc1[0] sdb1[1]
> 1048512 blocks [2/2] [UU]
>
> Personalities : [raid1]
> md1 : active raid1 sdc2[0] sdb2[1]
> 233381376 blocks [2/2] [UU]
> [===============>.....] check = 77.5% (180889856/233381376)
> finish=2.5min speed=342654K/sec
>
> md0 : active raid1 sdc1[0] sdb1[1]
> 1048512 blocks [2/2] [UU]
>
> Personalities : [raid1]
> md1 : active raid1 sdc2[0] sdb2[1]
> 233381376 blocks [2/2] [UU]
>
> md0 : active raid1 sdc1[0] sdb1[1]
> 1048512 blocks [2/2] [UU]
>
>
> Justin.
>

Attachment: signature.asc
Description: PGP signature