Re: [REGRESSION] block: virtio-blk + LVM raid1 spurious sector-0 read failures on libaio/threads submit since 5ff3f74e145a ("block: simplify direct io validity check")

From: Vjaceslavs Klimovs

Date: Sun May 17 2026 - 18:35:08 EST

Yes, that was exactly it. The patch works for raid1 logical volumes
but, for obvious reasons (these are dm raid) this still oopses on
legacy mirror logical volumes:

[ 2.168054] device-mapper: raid1: Mirror read failed from 252:0.
Trying alternative device.
[ 2.169241] BUG: unable to handle page fault for address: fffff580045f4bc8
[ 2.170256] #PF: supervisor read access in kernel mode
[ 2.170997] #PF: error_code(0x0000) - not-present page
[ 2.171706] PGD 7ff9d067 P4D 7ff9d067 PUD 7ff9c067 PMD 0
[ 2.172433] Oops: Oops: 0000 [#1] SMP PTI
[ 2.173003] CPU: 0 UID: 0 PID: 11 Comm: kworker/0:1 Not tainted
6.18.29+ #19 PREEMPT(lazy)
[ 2.174118] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS 1.16.3-20250108_150619-localhost 04/01/2014
[ 2.175472] Workqueue: kmirrord do_mirror
[ 2.176040] RIP: 0010:bio_add_page+0x8c/0x340
[ 2.176676] Code: 07 4d 8b 48 08 41 f6 c1 01 0f 85 d6 01 00 00 0f
1f 44 00 00 4d 89 c1 49 8b 11 48 c1 ea 33 83 e2 07 83 fa 04 0f 84 bf
00 00 00 <48> 8b 56 08 4c 8d 4a ff f6 c2 01 75
08 0f 1f 44 00 00 49 89 f1 49
[ 2.179169] RSP: 0018:ffffcea500063bc8 EFLAGS: 00010293
[ 2.179933] RAX: 0000000000000001 RBX: ffff8d53149af400 RCX:
0000000000000580
[ 2.180947] RDX: 0000000000000001 RSI: fffff580045f4bc0 RDI:
ffff8d53149af488
[ 2.181969] RBP: 0000000000000000 R08: fffff580005f4c00 R09:
fffff580005f4c00
[ 2.182978] R10: ffffcea500063c14 R11: 0000000000000a80 R12:
ffff8d5303192a80
[ 2.183997] R13: ffffcea500063c20 R14: 0000000000000001 R15:
ffffcea500063cf8
[ 2.185022] FS: 0000000000000000(0000) GS:ffff8d53ed4d5000(0000)
knlGS:0000000000000000
[ 2.186180] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2.187035] CR2: fffff580045f4bc8 CR3: 0000000002c44002 CR4:
0000000000372ef0
[ 2.188047] Call Trace:
[ 2.188417] <TASK>
[ 2.188756] do_region+0x21d/0x270
[ 2.189313] dispatch_io+0xf1/0x150
[ 2.189832] ? __pfx_bio_get_page+0x10/0x10
[ 2.190424] ? __pfx_bio_next_page+0x10/0x10
[ 2.191046] dm_io+0x136/0x240
[ 2.191503] ? __pfx_read_callback+0x10/0x10
[ 2.192108] ? __pfx_bio_get_page+0x10/0x10
[ 2.192708] ? __pfx_bio_next_page+0x10/0x10
[ 2.193319] do_reads+0x13e/0x210
[ 2.193807] ? __pfx_read_callback+0x10/0x10
[ 2.194411] do_mirror+0x117/0x2a0
[ 2.194912] process_one_work+0x18d/0x340
[ 2.195508] worker_thread+0x196/0x300
[ 2.196022] ? __pfx_worker_thread+0x10/0x10
[ 2.196617] kthread+0xfc/0x240
[ 2.197073] ? __pfx_kthread+0x10/0x10
[ 2.197606] ? __pfx_kthread+0x10/0x10
[ 2.198116] ret_from_fork+0x158/0x170
[ 2.198645] ? __pfx_kthread+0x10/0x10
[ 2.199161] ret_from_fork_asm+0x1a/0x30
[ 2.199736] </TASK>
[ 2.200053] Modules linked in:
[ 2.200493] CR2: fffff580045f4bc8
[ 2.200951] ---[ end trace 0000000000000000 ]---
[ 2.201599] RIP: 0010:bio_add_page+0x8c/0x340
[ 2.202193] Code: 07 4d 8b 48 08 41 f6 c1 01 0f 85 d6 01 00 00 0f
1f 44 00 00 4d 89 c1 49 8b 11 48 c1 ea 33 83 e2 07 83 fa 04 0f 84 bf
00 00 00 <48> 8b 56 08 4c 8d 4a ff f6 c2 01 75
08 0f 1f 44 00 00 49 89 f1 49
[ 2.204690] RSP: 0018:ffffcea500063bc8 EFLAGS: 00010293
[ 2.205390] RAX: 0000000000000001 RBX: ffff8d53149af400 RCX:
0000000000000580
[ 2.206368] RDX: 0000000000000001 RSI: fffff580045f4bc0 RDI:
ffff8d53149af488
[ 2.207333] RBP: 0000000000000000 R08: fffff580005f4c00 R09:
fffff580005f4c00
[ 2.208297] R10: ffffcea500063c14 R11: 0000000000000a80 R12:
ffff8d5303192a80
[ 2.209257] R13: ffffcea500063c20 R14: 0000000000000001 R15:
ffffcea500063cf8
[ 2.210265] FS: 0000000000000000(0000) GS:ffff8d53ed4d5000(0000)
knlGS:0000000000000000
[ 2.211391] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2.212201] CR2: fffff580045f4bc8 CR3: 0000000002c44002 CR4:
0000000000372ef0
[ 2.213196] Kernel panic - not syncing: Fatal exception
[ 2.214313] Kernel Offset: 0xc200000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 2.215981] Rebooting in 10 seconds..

On Fri, May 15, 2026 at 10:10 PM Thorsten Leemhuis
<regressions@xxxxxxxxxxxxx> wrote:
>
> On 5/15/26 18:52, Vjaceslavs Klimovs wrote:
> > Summary
> > -------
> > On v6.18, starting a libvirt/QEMU guest with virtio-blk backed by an
> > LVM "--type raid1" LV (drivers/md/dm-raid.c stacked on
> > drivers/md/raid1.c) makes md/raid1 register read failures at LV
> > sector 0 within seconds of "virsh start" and mark rimage_0 Faulty
> > once max_corrected_read_errors (default 20) is exceeded. Reads
> > succeed via the redirect path so guests boot, but every guest disk
> > ends up degraded on every VM start. Same workload on legacy
> > "--type mirror" (drivers/md/dm-raid1.c) crashes the host: a
> > zero-length READ reaches the NVMe controller, is rejected with
> > "Invalid Field in Command", and the dm-mirror recovery path oopses.
>
> That sounds somewhat like
> https://lore.kernel.org/all/2982107.4sosBPzcNG@electra/
>
> Have you tried latest 7.1-rc? It contains a fix for the problem
> mentioned in said thread: f7b24c7b41f23b ("md/raid1,raid10: don't fail
> devices for invalid IO errors") [v7.1-rc2]
>
> Ciao, Thorsten
>
> > Symptom on dm-raid raid1 (post --type raid1)
> > --------------------------------------------
> > Per LV, at virsh start, in host dmesg:
> >
> > kernel: raid1_end_read_request: 95 callbacks suppressed
> > kernel: raid1_read_request: 95 callbacks suppressed
> > kernel: md/raid1:mdX: dm-58: rescheduling sector 0
> > kernel: md/raid1:mdX: redirecting sector 0 to other mirror: dm-58
> > kernel: md/raid1:mdX: dm-58: rescheduling sector 0
> > kernel: md/raid1:mdX: redirecting sector 0 to other mirror: dm-58
> > [... 10 rescheduling/redirecting pairs ...]
> > kernel: md/raid1:mdX: dm-58: Raid device exceeded read_error
> > threshold [cur 21:max 20]
> > kernel: md/raid1:mdX: dm-58: Failing raid device
> > kernel: md/raid1:mdX: Disk failure on dm-58, disabling device.
> > kernel: md/raid1:mdX: Operation continuing on 1 devices.
> >
> > dmeventd: WARNING: Device #0 of raid1 array, vg0-iris_boot, has failed.
> > dmeventd: WARNING: Waiting for resynchronization to finish before
> > initiating repair on RAID device vg0-iris_boot.
> > dmeventd: Use 'lvconvert --repair vg0/iris_boot' to replace failed device.
> >
> > Subsequent "lvs -a":
> >
> > WARNING: RaidLV vg0/iris_boot needs to be refreshed!
> > See character 'r' at position 9 in the RaidLV's attributes and its SubLV(s).
> >
> > dmesg | grep nvme is EMPTY on this path. The NVMe driver is not
> > involved in producing the error; the failure originates between the
> > virtio-blk bio submission and raid1_end_read_request().
> >
> > Symptom on legacy dm-mirror (pre-conversion --type mirror)
> > ----------------------------------------------------------
> > Same workload on drivers/md/dm-raid1.c reaches the NVMe controller
> > as a zero-length READ and panics the host through dm-mirror's
> > recovery path:
> >
> > kernel: operation not supported error, dev nvme1n1, sector 935446535
> > op 0x0:(READ) flags 0x0 phys_seg 0 prio class 2
> > kernel: nvme1n1: I/O Cmd(0x2) @ LBA 935446535, 0 blocks, I/O Error
> > (sct 0x0 / sc 0x2)
> > [... 10+ identical bursts at same timestamp ...]
> > dmeventd: Primary mirror device 252:58 read failed.
> > dmeventd: vg0-iris_boot is now in-sync.
> > [kernel oops in dm_mirror recovery path, full trace lost to console flash]
> >
> > The "phys_seg 0", "0 blocks", "sct 0x0/sc 0x2" trio (NVMe Generic,
> > Invalid Field in Command, NVMe spec 4.1.1.2) is unambiguous: a bio
> > with bi_iter.bi_size == 0 and bi_vcnt == 0 left the block layer and
> > hit the controller. dm-raid raid1 hides this by retrying on the
> > surviving leg, but the upstream-of-md trigger is identical.
> >
> > Bisect
> > ------
> > git bisect, v6.12..v6.18, 16 deterministic GOOD/BAD steps, no skips,
> > ~104 minutes:
> >
> > 5ff3f74e145adc79b49668adb8de276446acf6be is the first bad commit
> > block: simplify direct io validity check
> >
> > --- a/block/fops.c
> > +++ b/block/fops.c
> > @@ -38,8 +38,8 @@ static blk_opf_t dio_bio_write_op(struct kiocb *iocb)
> > static bool blkdev_dio_invalid(struct block_device *bdev, struct kiocb *iocb,
> > struct iov_iter *iter)
> > {
> > - return iocb->ki_pos & (bdev_logical_block_size(bdev) - 1) ||
> > - !bdev_iter_is_aligned(bdev, iter);
> > + return (iocb->ki_pos | iov_iter_count(iter)) &
> > + (bdev_logical_block_size(bdev) - 1);
> > }
> >
> > The dropped bdev_iter_is_aligned() used to walk the iov_iter and
> > reject per-segment misaligned/degenerate vectors at the blkdev fops
> > entry point. The replacement only validates ki_pos and total length
> > against the logical block size. Cases that now pass that no longer
> > get rejected:
> >
> > - iter with iov_iter_count(iter) == 0 (degenerate; total length is
> > "sector-aligned" since 0 % 512 == 0)
> > - iter where total length is sector-aligned but a segment isn't
> >
> > The commit message justifies the removal with "The block layer
> > checks all the segments for validity later". This is true for the
> > io_uring submit path (which enters __blkdev_direct_IO directly and
> > does its own validation) but not for the libaio aio_read/write_iter
> > or the worker-pool sync read/write_iter paths that enter via
> > blkdev_{read,write}_iter() -> blkdev_dio_invalid(). For those paths,
> > the segment check has no replacement.
> >
> > Reproducing
> > ----------------------------------------------------------
> >
> > The trigger requires QEMU virtio-blk's specific submission shape AND
> > a non-io_uring submit. Userspace libaio alone, userspace
> > preadv-in-a-thread alone, and QEMU's raw-driver open probes (which
> > qemu-img info exercises identically) are all insufficient. The
> > combination that hits the bug is "guest-driven I/O through
> > virtio-blk-pci with cache.direct=on and aio in {native, threads}".
> >
> > #regzbot introduced: 5ff3f74e145adc79b49668adb8de276446acf6be
> >
> > Thanks,
> > Vjaceslavs Klimovs
> >
>