Re: raid0 + original layout + discard + different disk sizes => can corrupt data

From: Song Liu
Date: Tue Jun 20 2023 - 17:40:16 EST

Next message: Andres Freund: "Re: [PATCH v6 4/4] mm/khugepaged: maintain page cache uptodate flag"
Previous message: Jessica Zhang: "Re: [PATCH 2/3] drm/msm/dpu: Set DATABUS_WIDEN on command mode encoders"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Jason,

On Tue, Jun 20, 2023 at 2:00 PM Jason Baron <jbaron@xxxxxxxxxx> wrote:
>
> Hi,
>
> SUMMARY
> =======
>
> We've found that using raid0 with the 'original' layout and discard enabled
> with different disk sizes (such that at least two zones are created) can
> result in data corruption. This is due to the fact that the discard
> handling in 'raid0_handle_discard()' assumes the 'alternate' layout.
>
> More specifically, while multiple zones are necessary to create the
> corruption, the corruption may not occur with multiple zones if they
> layout in such a way the layout matches what the 'alternate' layout
> would have produced. Thus, not all raid0 devices with the 'original'
> layout, different size disks and discard enabled will encounter this
> corruption.
>
> BACKGROUND
> ==========
>
> The 3.14 kernel inadvertleny changed the raid0 disk layout for different
> size disks. Thus, running a pre-3.14 kernel and post-3.14 kernel on the
> same raid0 array could corrupt data. This lead to the creation of the
> 'original' layout (to match the pre-3.14 layout) and the 'alternate' layout
> (to match the post 3.14 layout) in the 5.4 kernel time frame and an option
> to tell the kernel which layout to use (since it couldn't be autodetected)
> [1]. The mdadm tool also added support for writing the layout to the raid0
> superblock which the kernel honors. The mdadm tool defaults to the original
> layout unless --layout is given (--layout=alternate or --layout=original).
>
> REPRODUCER (pseudocode)
> =======================
>
> The below steps were the simplest we found to consistently reproduced the
> corruption:
>
> 1. create raid0 array with different size disks using original layout
> 2. mkfs
> 3. mount -o discard
> 4. create lots of files
> 5. remove 1/2 the files
> 6. fstrim -a (or just the mount point for the raid0 array)
> 7. umount
> 8. fsck -fn /dev/md1
>
> Here is an example corruption, though we've seen many others:
>
> fsck from util-linux 2.34
> e2fsck 1.45.5 (07-Jan-2020)
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Directory inode 24557, block #0, offset 0: directory has no checksum.
> Fix? no
>
> Directory inode 24557, block #0, offset 0: directory corrupted
> Salvage? no
>
> e2fsck: aborted
>
> /dev/md1: ********** WARNING: Filesystem still has errors **********
>
>
> VULNERABLE VERSIONS
> ===================
>
> We've confirmed the corruption with the latest 6.4-rc7 kernel. Kernels that
> support the 'original' layout could have this issue. The 'original' layout
> was added to the 5.4+ kernels, although the 'original' layout may have been
> backported to earlier kernels.
>
> Kernels versions from 3.14 to 5.4 that only have support for the 'alternate'
> layout should be ok. I'm not sure about pre-3.14 kernels that only have the
> 'original' layout. One would have to go back and see if they support
> discard and if that support is correct.
>
> WORKAROUNDS
> ===========
>
> Until a proper code fix is implemented in the kernel and updated, valid
> workarounds we have found are any 1 of the following:
>
> 1. disable discard and fstrim from running on the raid0 array
> 2. create new arrays only with the 'alternate' layout, the 'mdadm'
> command understands mdadm --layout=alternate
> 3. ensure there is only 1 raid0 zone (if disks are all the same size)
>
> That said, you may already have 'silent corruption' if data is not covered
> by a checksum.
>
> FIXING
> =======
>
> It may make sense to have the kernel just disable discard for raid0 with
> the 'original' layout, and update mdadm to default to 'alternate' and treat
> the 'original' layout more as a deprecated format? I'm curious as to why
> mdadm is using 'original' layout as the default for new arrays?
>
> Alternatively, we could patch the 'original' layout and add proper discard
> support.

Thanks for the detailed report and analysis!

We will look into fixing it.

Song

>
> THANKS
> ======
>
> Thanks to Michael Zhivich, Debabrata Banerjee and Jeff Dike for helping uncover this issue.
>
> Thanks,
>
> -Jason
>
> [1] https://wiki.ubuntu.com/Kernel/Raid0LayoutMigration

Next message: Andres Freund: "Re: [PATCH v6 4/4] mm/khugepaged: maintain page cache uptodate flag"
Previous message: Jessica Zhang: "Re: [PATCH 2/3] drm/msm/dpu: Set DATABUS_WIDEN on command mode encoders"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]