XFS metadata CRC errors on zram block device on ppc64le architecture

From: Dusty Mabe
Date: Tue Aug 01 2023 - 23:31:48 EST


In Fedora CoreOS we found an issue with an interaction of an XFS filesystem on a zram block device on ppc64le:

- https://github.com/coreos/fedora-coreos-tracker/issues/1489
- https://bugzilla.redhat.com/show_bug.cgi?id=2221314

The dmesg output shows several errors:

```
[ 3247.206007] XFS (zram0): Mounting V5 Filesystem 0b7d6149-614c-4f4c-9a1f-a80a9810f58f
[ 3247.210781] XFS (zram0): Metadata CRC error detected at xfs_agf_read_verify+0x108/0x150 [xfs], xfs_agf block 0x80008
[ 3247.211121] XFS (zram0): Unmount and run xfs_repair
[ 3247.211198] XFS (zram0): First 128 bytes of corrupted metadata buffer:
[ 3247.211293] 00000000: fe ed ba be 00 00 00 00 00 00 00 02 00 00 00 00 ................
[ 3247.211405] 00000010: 00 00 00 00 00 00 00 18 00 00 00 01 00 00 00 00 ................
[ 3247.211515] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 3247.211625] 00000030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 3247.211735] 00000040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 3247.211842] 00000050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 3247.211951] 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 3247.212063] 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
[ 3247.212171] XFS (zram0): metadata I/O error in "xfs_read_agf+0xb4/0x180 [xfs]" at daddr 0x80008 len 8 error 74
[ 3247.212485] XFS (zram0): Error -117 reserving per-AG metadata reserve pool.
[ 3247.212497] XFS (zram0): Corruption of in-memory data (0x8) detected at xfs_fs_reserve_ag_blocks+0x1e0/0x220 [xfs] (fs/xfs/xfs_fsops.c:587). Shutting down filesystem.
[ 3247.212828] XFS (zram0): Please unmount the filesystem and rectify the problem(s)
[ 3247.212943] XFS (zram0): Ending clean mount
[ 3247.212970] XFS (zram0): Error -5 reserving per-AG metadata reserve pool.
```

The issue can be reproduced easily with a simple script:

```
[root@p8 ~]# cat test.sh
#!/bin/bash
set -eux -o pipefail
modprobe zram num_devices=0
read dev < /sys/class/zram-control/hot_add
echo 10G > /sys/block/zram"${dev}"/disksize
mkfs.xfs /dev/zram"${dev}"
mkdir -p /tmp/foo
mount -t xfs /dev/zram"${dev}" /tmp/foo
```

We ran a kernel bisect and narrowed it down to offending commit af8b04c6:

```
[root@ibm-p8-kvm-03-guest-02 linux]# git bisect good
af8b04c63708fa730c0257084fab91fb2a9cecc4 is the first bad commit
commit af8b04c63708fa730c0257084fab91fb2a9cecc4
Author: Christoph Hellwig <hch@xxxxxx>
Date: Tue Apr 11 19:14:46 2023 +0200

zram: simplify bvec iteration in __zram_make_request

bio_for_each_segment synthetize bvecs that never cross page boundaries, so
don't duplicate that work in an inner loop.

Link: https://lkml.kernel.org/r/20230411171459.567614-5-hch@xxxxxx
Signed-off-by: Christoph Hellwig <hch@xxxxxx>
Reviewed-by: Sergey Senozhatsky <senozhatsky@xxxxxxxxxxxx>
Acked-by: Minchan Kim <minchan@xxxxxxxxxx>
Cc: Jens Axboe <axboe@xxxxxxxxx>
Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>

drivers/block/zram/zram_drv.c | 42 +++++++++++-------------------------------
1 file changed, 11 insertions(+), 31 deletions(-)
```

Any ideas on how to fix the problem?

Thanks!
Dusty