RAID5->RAID6 reshape remains stuck at 0% (does nothing, not even start)
From: David Madore
Date: Tue Sep 29 2020 - 21:05:34 EST
Dear list,
I'm trying to reshape a 3-disk RAID5 array to a 4-disk RAID6 array (of
the same total size and per-device size) using linux kernel 4.9.237 on
x86_64. I understand that this reshaping operation is supposed to be
supported. But it appears perpetually stuck at 0% with no operation
taking place whatsoever (the slices are unchanged apart from their
metadata, the backup file contains only zeroes, and nothing happens).
I wonder if this is a know kernel bug, or what else could explain it,
and I have no idea how to debug this sort of thing.
Here are some details on exactly what I've been doing. I'll be using
loopbacks to illustrate, but I've done this on real partitions and
there was no difference.
## Create some empty loop devices:
for i in 0 1 2 3 ; do dd if=/dev/zero of=test-${i} bs=1024k count=16 ; done
for i in 0 1 2 3 ; do losetup /dev/loop${i} test-${i} ; done
## Make a RAID array out of the first three:
mdadm --create /dev/md/test --level=raid5 --chunk=256 --name=test \
--metadata=1.0 --raid-devices=3 /dev/loop{0,1,2}
## Populate it with some content, just to see what's going on:
for i in $(seq 0 63) ; do printf "This is chunk %d (0x%x).\n" $i $i \
| dd of=/dev/md/test bs=256k seek=$i ; done
## Now try to reshape the array from 3-way RAID5 to 4-way RAID6:
mdadm --manage /dev/md/test --add-spare /dev/loop3
mdadm --grow /dev/md/test --level=6 --raid-devices=4 \
--backup-file=test-reshape.backup
...and then nothing happens. /proc/mdstat reports no progress
whatsoever:
md112 : active raid6 loop3[4] loop2[3] loop1[1] loop0[0]
32256 blocks super 1.0 level 6, 256k chunk, algorithm 18 [4/3] [UUU_]
[>....................] reshape = 0.0% (1/16128) finish=1.0min speed=244K/sec
The loop file contents are unchanged except for the metadata
superblock, the backup file is entirely empty, and no activity
whatsoever is happening.
Actually, further investigation shows that the array is in fact
operational as a RAID6 array, but one where the Q-syndrome is stuck in
the last device: writing data to the md device (e.g., by repopulating
it with the same command as above) does cause loop3 to be updated as
expected for such a layout. It's just the reshaping which doesn't
take place (or indeed begin).
For completeness, here's what mdadm --detail /dev/md/test looks like
before the reshape, in my example:
/dev/md/test:
Version : 1.0
Creation Time : Wed Sep 30 02:42:30 2020
Raid Level : raid5
Array Size : 32256 (31.50 MiB 33.03 MB)
Used Dev Size : 16128 (15.75 MiB 16.52 MB)
Raid Devices : 3
Total Devices : 4
Persistence : Superblock is persistent
Update Time : Wed Sep 30 02:44:21 2020
State : clean
Active Devices : 3
Working Devices : 4
Failed Devices : 0
Spare Devices : 1
Layout : left-symmetric
Chunk Size : 256K
Name : vega.stars:test (local to host vega.stars)
UUID : 30f40e34:b9a52ff0:75c8b063:77234832
Events : 20
Number Major Minor RaidDevice State
0 7 0 0 active sync /dev/loop0
1 7 1 1 active sync /dev/loop1
3 7 2 2 active sync /dev/loop2
4 7 3 - spare /dev/loop3
- and here's what it looks like after the attempted reshape has
started (or rather, refused to start):
/dev/md/test:
Version : 1.0
Creation Time : Wed Sep 30 02:42:30 2020
Raid Level : raid6
Array Size : 32256 (31.50 MiB 33.03 MB)
Used Dev Size : 16128 (15.75 MiB 16.52 MB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent
Update Time : Wed Sep 30 02:44:54 2020
State : clean, degraded, reshaping
Active Devices : 3
Working Devices : 4
Failed Devices : 0
Spare Devices : 1
Layout : left-symmetric-6
Chunk Size : 256K
Reshape Status : 0% complete
New Layout : left-symmetric
Name : vega.stars:test (local to host vega.stars)
UUID : 30f40e34:b9a52ff0:75c8b063:77234832
Events : 22
Number Major Minor RaidDevice State
0 7 0 0 active sync /dev/loop0
1 7 1 1 active sync /dev/loop1
3 7 2 2 active sync /dev/loop2
4 7 3 3 spare rebuilding /dev/loop3
I also tried writing "frozen" and then "resync" to the
/sys/block/md112/md/sync_action file with no further results.
I welcome any suggestions on how to investigate, work around, or fix
this problem.
Happy hacking,
--
David A. Madore
( http://www.madore.org/~david/ )