Re: Kernel 6.7 regression doesn't boot if using AMD eGPU

From: Robin Murphy
Date: Mon Apr 15 2024 - 17:44:50 EST


On 2024-04-15 7:57 pm, Eric Wagner wrote:
Apologies if I made a mistake in the first bisect, I'm new to kernel
debugging.

I tested cedc811c76778bdef91d405717acee0de54d8db5 (x86/amd) and
3613047280ec42a4e1350fdc1a6dd161ff4008cc (core) directly and both were good.
Then I ran git bisect again with e8cca466a84a75f8ff2a7a31173c99ee6d1c59d2
as the bad and 6e6c6d6bc6c96c2477ddfea24a121eb5ee12b7a3 as the good and the
bisect log is attached. It ended up at the same commit as before.

I've also attached a picture of the boot screen that occurs when it hangs.
0000:05:00.0 is the PCIe bus address of the RX 580 eGPU that's causing the
problem.

Looks like 59ddce4418da483 probably broke things most - prior to that, the fact that it's behind a Thunderbolt port would have always taken precedence and forced IOMMU_DOMAIN_DMA regardless of what the driver may have wanted to say, whereas now we ask the driver first, then complain that it conflicts with the untrusted status and ultimately don't configure the IOMMU at all. Meanwhile the GPU driver presumably goes on to believe it's using dma-direct with no IOMMU present, resulting in fireworks when its traffic reaches the IOMMU. Great :(

However the other notable thing that also happened between 6.6 and 6.7 was the removal of the AMD iommu_v2 code, so there's some possibility that the GPU driver still may have only been working before due to that also subverting the default domain with its own identity domain, so whether it would actually work again with iommu_get_default_domain_type() sorted out is yet another question... As a first step I'd test the quick hack below, but be prepared for things to still break slightly differently.

Cheers,
Robin.

----->8-----
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 996e79dc582d..063e1eb32fbd 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -1774,7 +1774,7 @@ static int iommu_get_default_domain_type(struct iommu_group *group,
untrusted,
"Device is not trusted, but driver is overriding group %u to %s, refusing to probe.\n",
group->id, iommu_domain_type_str(driver_type));
- return -1;
+ //return -1;
}
driver_type = IOMMU_DOMAIN_DMA;
}