Re: [REGRESSION] GPU passes into VM improperly after c376a3456d8b or a98db518dde2
From: Baolu Lu
Date: Mon Apr 27 2026 - 03:21:00 EST
On 4/14/26 17:22, 70sp wrote:
I can confirm, that the "domain is not compatible with device" message is nowhere to be seen.
I have double checked by also adding an else statement with a different message and that one showed up several times. (by pci (iGPU) 0000:00:02.0, pcieport 0000:00:01.0 and vfio-pci (GTX 970) 0000:01:00.0, 0000:01:00.1). ret = 0.
Hmm, it seems the domain is compatible with the device hardware and was
attached successfully. Perhaps you can try to check the differences
between these two domain attachments by dumping the root, context, and
PASID table entries and comparing the configurations of the success and
failure cases.
To do this, simply apply the change below with CONFIG_DMAR_DEBUG
enabled:
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 4d0e65bc131d..bf303cfcf2ee 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -1345,6 +1345,9 @@ static int dmar_domain_attach_device(struct dmar_domain *domain,
if (ret)
goto out_block_translation;
+ dmar_fault_dump_ptes(iommu, PCI_DEVID(info->bus, info->devfn),
+ 0, IOMMU_NO_PASID);
+
return 0;
out_block_translation:
Thanks,
baolu
Sent with Proton Mail secure email.
On Monday, April 13th, 2026 at 8:49 AM, Baolu Lu <baolu.lu@xxxxxxxxxxxxxxx> wrote:
On 4/12/26 19:17, 70sp wrote:
Hello,
I have been dealing with a regression launching a Windows QEMU/KVM
virtual machine with a GPU passed through.
The issue consists of launching a QEMU/KVM VM, which gets stuck for
about 2 minutes on booting with a white screen and then having NVIDIA’s
code 43 in Windows.
I’m certain, that the issue is not caused by anything in Windows or
related software in Linux, because I tried reinstalling my whole PC
including the Windows VM. I tried to reproduce the bug on an out-of-the-
box Arch Linux install and the bug is still present.
The first bad commit is either a98db518dde246e01ead53617dc0a30d6aaa3752
or c376a3456d8bef43ec556a98c0a04c35086c2737. I don’t know for sure which
one introduced it, because during bisection I had to skip
a98db518dde246e01ead53617dc0a30d6aaa3752 due to it being unable to
launch the virtual machine resulting in a different error (didn’t even
start booting). In kernels before these commits, the VM works flawlessly.
I have tested it on latest mainline kernel and the issue is still
present. I have been experiencing the issue since kernel 6.13, so I just
switched to the 6.12 LTS kernel instead which doesn’t have this issue.
Configuration of my Linux install and hardware: https://pastebin.com/
rcsyyYiK
.config: https://pastebin.com/RTQCBduD
dmesg errors: https://pastebin.com/84jPP81E
lspci: https://pastebin.com/qi29BSWi
#regzbot introduced:
a98db518dde246e01ead53617dc0a30d6aaa3752..c376a3456d8bef43ec556a98c0a04c35086c2737
Before these commits, if a device was attached to a domain that didn't
perfectly match the hardware's capabilities (such as address width or
coherency), the kernel would dynamically adjust the domain to
accommodate the hardware.
Following these two commits, the driver now applies a "match or fail"
policy. If the domain is incompatible with the device's hardware
capabilities, it returns -EINVAL. This expects the caller to allocate a
new domain dedicated to that specific device and attempt the attachment
again.
Can you please add a message line in paging_domain_compatible() to
verify whether it's a domain compatibility issue?
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 205debd76989..c7e1e0dfa250 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -3111,8 +3111,10 @@ int paging_domain_compatible(struct iommu_domain
*domain, struct device *dev)
ret =
paging_domain_compatible_second_stage(dmar_domain, iommu);
else if (WARN_ON(true))
ret = -EINVAL;
- if (ret)
+ if (ret) {
+ dev_info(dev, "domain is not compatible with device, ret
= %d", ret);
return ret;
+ }
if (sm_supported(iommu) && !dev_is_real_dma_subdevice(dev) &&
context_copied(iommu, info->bus, info->devfn))
Thanks,
baolu