Re: [REGRESSION] GPU passes into VM improperly after c376a3456d8b or a98db518dde2
From: 70sp
Date: Thu Apr 23 2026 - 05:33:07 EST
Hello,
sending a friendly reminder about this ongoing regression.
Thank you for your attention.
On Tuesday, April 14th, 2026 at 11:22 AM, 70sp <70sp@xxxxxxxxxxxxxx> wrote:
> I can confirm, that the "domain is not compatible with device" message is nowhere to be seen.
>
> I have double checked by also adding an else statement with a different message and that one showed up several times. (by pci (iGPU) 0000:00:02.0, pcieport 0000:00:01.0 and vfio-pci (GTX 970) 0000:01:00.0, 0000:01:00.1). ret = 0.
>
>
>
> Sent with Proton Mail secure email.
>
> On Monday, April 13th, 2026 at 8:49 AM, Baolu Lu <baolu.lu@xxxxxxxxxxxxxxx> wrote:
>
> > On 4/12/26 19:17, 70sp wrote:
> > > Hello,
> > >
> > > I have been dealing with a regression launching a Windows QEMU/KVM
> > > virtual machine with a GPU passed through.
> > >
> > > The issue consists of launching a QEMU/KVM VM, which gets stuck for
> > > about 2 minutes on booting with a white screen and then having NVIDIA’s
> > > code 43 in Windows.
> > >
> > > I’m certain, that the issue is not caused by anything in Windows or
> > > related software in Linux, because I tried reinstalling my whole PC
> > > including the Windows VM. I tried to reproduce the bug on an out-of-the-
> > > box Arch Linux install and the bug is still present.
> > >
> > > The first bad commit is either a98db518dde246e01ead53617dc0a30d6aaa3752
> > > or c376a3456d8bef43ec556a98c0a04c35086c2737. I don’t know for sure which
> > > one introduced it, because during bisection I had to skip
> > > a98db518dde246e01ead53617dc0a30d6aaa3752 due to it being unable to
> > > launch the virtual machine resulting in a different error (didn’t even
> > > start booting). In kernels before these commits, the VM works flawlessly.
> > >
> > > I have tested it on latest mainline kernel and the issue is still
> > > present. I have been experiencing the issue since kernel 6.13, so I just
> > > switched to the 6.12 LTS kernel instead which doesn’t have this issue.
> > >
> > > Configuration of my Linux install and hardware: https://pastebin.com/
> > > rcsyyYiK
> > > .config: https://pastebin.com/RTQCBduD
> > > dmesg errors: https://pastebin.com/84jPP81E
> > > lspci: https://pastebin.com/qi29BSWi
> > >
> > > #regzbot introduced:
> > > a98db518dde246e01ead53617dc0a30d6aaa3752..c376a3456d8bef43ec556a98c0a04c35086c2737
> >
> > Before these commits, if a device was attached to a domain that didn't
> > perfectly match the hardware's capabilities (such as address width or
> > coherency), the kernel would dynamically adjust the domain to
> > accommodate the hardware.
> >
> > Following these two commits, the driver now applies a "match or fail"
> > policy. If the domain is incompatible with the device's hardware
> > capabilities, it returns -EINVAL. This expects the caller to allocate a
> > new domain dedicated to that specific device and attempt the attachment
> > again.
> >
> > Can you please add a message line in paging_domain_compatible() to
> > verify whether it's a domain compatibility issue?
> >
> > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> > index 205debd76989..c7e1e0dfa250 100644
> > --- a/drivers/iommu/intel/iommu.c
> > +++ b/drivers/iommu/intel/iommu.c
> > @@ -3111,8 +3111,10 @@ int paging_domain_compatible(struct iommu_domain
> > *domain, struct device *dev)
> > ret =
> > paging_domain_compatible_second_stage(dmar_domain, iommu);
> > else if (WARN_ON(true))
> > ret = -EINVAL;
> > - if (ret)
> > + if (ret) {
> > + dev_info(dev, "domain is not compatible with device, ret
> > = %d", ret);
> > return ret;
> > + }
> >
> > if (sm_supported(iommu) && !dev_is_real_dma_subdevice(dev) &&
> > context_copied(iommu, info->bus, info->devfn))
> >
> > Thanks,
> > baolu
> >