Re: Filesystem corruption when adding a new RAID device (delayed-resync, write-mostly)

From: Paul E Luse
Date: Wed Jul 24 2024 - 17:19:19 EST


On Wed, 24 Jul 2024 22:35:49 +0200
Mateusz Jończyk <mat.jonczyk@xxxxx> wrote:

> W dniu 22.07.2024 o 07:39, Mateusz Jończyk pisze:
> > W dniu 20.07.2024 o 16:47, Mateusz Jończyk pisze:
> >> Hello,
> >>
> >> In my laptop, I used to have two RAID1 arrays on top of NVMe and
> >> SATA SSD drives: /dev/md0 for /boot (not partitioned), /dev/md1
> >> for remaining data (LUKS
> >> + LVM + ext4). For performance, I have marked the RAID component
> >> device for /dev/md1 on the SATA SSD drive write-mostly, which
> >> "means that the 'md' driver will avoid reading from these devices
> >> if at all possible" (man mdadm).
> >>
> >> Recently, the NVMe drive started having problems (PCI AER errors
> >> and the controller disappearing), so I removed it from the arrays
> >> and wiped it. However, I have reseated the drive in the M.2 socket
> >> and this apparently fixed it (verified with tests).
> >>
> >>     $ cat /proc/mdstat
> >>     Personalities : [raid1] [linear] [multipath] [raid0] [raid6]
> >> [raid5] [raid4] [raid10] md1 : active raid1 sdb5[1](W)
> >>           471727104 blocks super 1.2 [2/1] [_U]
> >>           bitmap: 4/4 pages [16KB], 65536KB chunk
> >>
> >>     md2 : active (auto-read-only) raid1 sdb6[3](W) sda1[2]
> >>           3142656 blocks super 1.2 [2/2] [UU]
> >>           bitmap: 0/1 pages [0KB], 65536KB chunk
> >>
> >>     md0 : active raid1 sdb4[3]
> >>           2094080 blocks super 1.2 [2/1] [_U]
> >>          
> >>     unused devices: <none>
> >>
> >> (md2 was used just for testing, ignore it).
> >>
> >> Today, I have tried to add the drive back to the arrays by using a
> >> script that executed in quick succession:
> >>
> >>     mdadm /dev/md0 --add --readwrite /dev/nvme0n1p2
> >>     mdadm /dev/md1 --add --readwrite /dev/nvme0n1p3
> >>
> >> This was on Linux 6.10.0, patched with my previous patch:
> >>
> >>     https://lore.kernel.org/linux-raid/20240711202316.10775-1-mat.jonczyk@xxxxx/
> >>
> >> (which fixed a regression in the kernel and allows it to start
> >> /dev/md1 with a single drive in write-mostly mode).
> >> In the background, I was running "rdiff-backup --compare" that was
> >> comparing data between my array contents and a backup attached via
> >> USB.
> >>
> >> This, however resulted in mayhem - I was unable to start any
> >> program with an input-output error, etc. I used SysRQ + C to save
> >> a kernel log:
> >>
> > Hello,
> >
> > It is possible that my second SSD has some problems and high read
> > activity during RAID resync triggered it. Reads from that drive are
> > now very slow (between 10 - 30 MB/s) and this suggests that
> > something is not OK.
>
> Hello,
>
> Unfortunately, hardware failure seems not to be the case.
>
> I did test it again on 6.10, twice, and in both cases I got
> filesystem corruption (but not as severe).
>
> On Linux 6.1.96 it seems to be working well (also did two tries).
>
> Please note: in my tests, I was using a RAID component device with
> a write-mostly bit set. This setup does not work on 6.9+ out of the
> box and requires the following patch:
>
> commit 36a5c03f23271 ("md/raid1: set max_sectors during early return
> from choose_slow_rdev()")
>
> that is in master now.
>
> It is also heading into stable, which I'm going to interrupt.

Hi Mateusz,

I'm pretty interested in what is happening here especially as it
relates to write-mostly. Couple of questions for you:

1) Are you able to find a simpler reproduction for this, for example
without mixing SATA and NVMe. Maybe just using two known good NVMe
SSDs and follow your steps to repro?

2) I don't fully understand your last two statements, maybe you can
clarify? With your max_sectors patch does it pass or fail? If pass,
what do mean by "I'm going to interrupt"? It sounds like you mean the
patch doesn't work and you are trying to stop it??

thanks
Paul

>
> Greetings,
> Mateusz
>
>