Re: Kernel 6.7 regression doesn't boot if using AMD eGPU

From: Jason Gunthorpe
Date: Mon Apr 15 2024 - 20:39:15 EST


On Mon, Apr 15, 2024 at 10:44:34PM +0100, Robin Murphy wrote:
> On 2024-04-15 7:57 pm, Eric Wagner wrote:
> > Apologies if I made a mistake in the first bisect, I'm new to kernel
> > debugging.
> >
> > I tested cedc811c76778bdef91d405717acee0de54d8db5 (x86/amd) and
> > 3613047280ec42a4e1350fdc1a6dd161ff4008cc (core) directly and both were good.
> > Then I ran git bisect again with e8cca466a84a75f8ff2a7a31173c99ee6d1c59d2
> > as the bad and 6e6c6d6bc6c96c2477ddfea24a121eb5ee12b7a3 as the good and the
> > bisect log is attached. It ended up at the same commit as before.
> >
> > I've also attached a picture of the boot screen that occurs when it hangs.
> > 0000:05:00.0 is the PCIe bus address of the RX 580 eGPU that's causing the
> > problem.
>
> Looks like 59ddce4418da483 probably broke things most - prior to that, the
> fact that it's behind a Thunderbolt port would have always taken precedence
> and forced IOMMU_DOMAIN_DMA regardless of what the driver may have wanted to
> saywhereas now we ask the driver first, then complain that it conflicts
> with the untrusted status and ultimately don't configure the IOMMU
> at all.

Yes, if the driver wants to force a domain type it should be
forced. Driver knows best. Eg AMD forces IDENTITY when the HW/driver
is incapable of supporting otherwise.

> Meanwhile the GPU driver presumably goes on to believe it's using dma-direct
> with no IOMMU present, resulting in fireworks when its traffic reaches the
> IOMMU. Great :(

I wonder where is the missing error handling.. iommu probe failure
should not go on to allow driver attach, there is no guarentee any DMA
works now that many iommus are booting up in BLOCKED.

> However the other notable thing that also happened between 6.6 and 6.7 was
> the removal of the AMD iommu_v2 code, so there's some possibility that the
> GPU driver still may have only been working before due to that also

Most likely it is the above change interacting with this patch when
they are both combined in the merge:

commit 92e2bd56a5f9fc44313fda802a43a63cc2a9c8f6
Author: Vasant Hegde <vasant.hegde@xxxxxxx>
Date: Thu Sep 21 09:21:45 2023 +0000

iommu/amd: Introduce iommu_dev_data.flags to track device capabilities

@@ -2471,7 +2481,7 @@ static int amd_iommu_def_domain_type(struct device *dev)
* and require remapping.
* - SNP is enabled, because it prohibits DTE[Mode]=0.
*/
- if (dev_data->iommu_v2 &&
+ if (pdev_pasid_supported(dev_data) &&
!cc_platform_has(CC_ATTR_MEM_ENCRYPT) &&
!amd_iommu_snp_en) {
return IOMMU_DOMAIN_IDENTITY;

Which, IIRC, was intended to be temporary to work around limitations
in the DTE programming logic within the driver. Previously iommu_v2 as
a module option that Eric probably doesn't set, I guess.

The below will probably make it boot, but Vasant should check what
happens if PASID is eventually attached too.

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index d35c1b8c8e65ce..f3da6a5b6cb1cb 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -2758,11 +2758,16 @@ static int amd_iommu_def_domain_type(struct device *dev)
* and require remapping.
* - SNP is enabled, because it prohibits DTE[Mode]=0.
*/
- if (pdev_pasid_supported(dev_data) &&
- !cc_platform_has(CC_ATTR_MEM_ENCRYPT) &&
- !amd_iommu_snp_en) {
+ if (!cc_platform_has(CC_ATTR_MEM_ENCRYPT) && !amd_iommu_snp_en)
+ return IOMMU_DOMAIN_IDENTITY;
+
+ /*
+ * For now driver limitations prevent enabling PASID as a paging domain
+ * on the RID together.
+ */
+ if (dev_is_pci(dev) && !to_pci_dev(dev)->untrusted &&
+ pdev_pasid_supported(dev_data))
return IOMMU_DOMAIN_IDENTITY;
- }

return 0;
}

Jason