Re: FreeNAS VM disk access errors, bisected to commit 6f1a4891a592

From: Thomas Gleixner
Date: Fri Apr 17 2020 - 16:20:02 EST


Marc,

Marc Dionne <marc.c.dionne@xxxxxxxxx> writes:

> Commit 6f1a4891a592 ("x86/apic/msi: Plug non-maskable MSI affinity
> race") causes Linux VMs hosted on FreeNAS (bhyve hypervisor) to lose
> access to their disk devices shortly after boot. The disks are zfs
> zvols on the host, presented to each VM.
>
> Background: I recently updated some fedora 31 VMs running under the
> bhyve hypervisor (hosted on a FreeNAS mini), and they moved to a
> distro 5.5 kernel (5.5.15). Shortly after reboot, the disks became
> inaccessible with any operation getting EIO errors. Booting back into
> a 5.4 kernel, everything was fine. I built a 5.7-rc1 kernel, which
> showed the same symptoms, and was then able to bisect it down to
> commit 6f1a4891a592. Note that the symptoms do not occur on every
> boot, but often enough (roughly 80%) to make bisection possible.
>
> Applying a manual revert of 6f1a4891a592 on top of mainline from
> yesterday gives me a kernel that works fine.

we tested on real hardware and various hypervisors that the fix actually
works correctly.

That makes me assume that the staged approach of changing affinity for
this non-maskable MSI mess makes your particular hypervisor unhappy.

Are there any messages like this:

"do_IRQ: 0.83 No irq handler for vector"

in dmesg on the Linux side? If they happen then before the disk timeout
happens.

I have absolutely zero knowledge about bhyve, so may I suggest to talk
to the bhyve experts about this.

Thanks,

tglx