Re: Kernel 6.7 regression doesn't boot if using AMD eGPU

From: Eric Wagner
Date: Mon Apr 15 2024 - 14:59:47 EST


Apologies if I made a mistake in the first bisect, I'm new to kernel debugging.

I tested cedc811c76778bdef91d405717acee0de54d8db5 (x86/amd) and 3613047280ec42a4e1350fdc1a6dd161ff4008cc (core) directly and both were good.
Then I ran git bisect again with e8cca466a84a75f8ff2a7a31173c99ee6d1c59d2 as the bad and 6e6c6d6bc6c96c2477ddfea24a121eb5ee12b7a3 as the good and the bisect log is attached. It ended up at the same commit as before.

I've also attached a picture of the boot screen that occurs when it hangs 0000:05:00.0 is the PCIe bus address of the RX 580 eGPU that's causing the problem.

On Mon, Apr 15, 2024 at 12:30 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote:
On Sat, Apr 13, 2024 at 06:04:12PM -0400, Eric Wagner wrote:
>    On my Thinkpad T14s G3 AMD (Ryzen 7 6850U) laptop connected to an AMD
>    RX 580 in Akitio Node Thunderbolt 3 eGPU. Booting with the eGPU
>    connected hangs on kernels 6.7 and 6.8, but worked on 6.6 For
>    debugging, I find that adding the kernel parameter amd_iommu=off seems
>    to fix the issue and allows booting with the eGPU on 6.7.
>    I tried bisecting the issue between 6.6 and 6.7 and ended up with:
>    "e8cca466a84a75f8ff2a7a31173c99ee6d1c59d2 is the first bad commit" in
>    the attached. This seems to indicate an amd iommu issue.
>    Two others also reported the same issue on AMD Ryzen 7 7840 with AMD RX
>    6000 connected as eGPU
>    ([1]https://gitlab.freedesktoporg/drm/amd/-/issues/3182).
>    Let me know if you need more information.
>
> References
>
>    1. https://gitlab.freedesktop.org/drm/amd/-/issues/3182

> Bisecting: 366 revisions left to test after this (roughly 9 steps)
> [74e9347ebc5be452935fe4f3eddb150aa5a6f4fe] Merge tag 'loongarch-fixes-6.6-3' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
> Bisecting: 182 revisions left to test after this (roughly 8 steps)
> [f6176471542d991137543af2ef1c18dae3286079] Merge tag 'mtd/fixes-for-6.6-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux
> Bisecting: 87 revisions left to test after this (roughly 7 steps)
> [fe3cfe869d5e0453754cf2b4c75110276b5e8527] Merge tag 'phy-fixes-6.6' of git://git.kernel.org/pub/scm/linux/kernel/git/phy/linux-phy
> Bisecting: 43 revisions left to test after this (roughly 6 steps)
> [c76c067e488ccd55734c3e750799caf2c5956db6] s390/pci: Use dma-iommu layer
> Bisecting: 27 revisions left to test after this (roughly 5 steps)
> [aa5cabc4ce8e6b45d170d162dc54b1bac1767c47] Merge tag 'arm-smmu-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/will/linux into arm/smmu
> Bisecting: 14 revisions left to test after this (roughly 4 steps)
> [bbc70e0aec287e164344b1a071bd46466a4f29b3] iommu/dart: Remove the force_bypass variable
> Bisecting: 9 revisions left to test after this (roughly 3 steps)
> [e82c175e63229ea495a0a0b5305a98b5b6ee5346] Revert "iommu/vt-d: Remove unused function"
> Bisecting: 5 revisions left to test after this (roughly 2 steps)
> [92bce97f0c341d3037b0f364b6839483f6a41cae] s390/pci: Fix reset of IOMMU software counters
> Bisecting: 3 revisions left to test after this (roughly 2 steps)
> [3613047280ec42a4e1350fdc1a6dd161ff4008cc] Merge tag 'v6.6-rc7' into core
> Bisecting: 2 revisions left to test after this (roughly 1 step)
> [f7da9c081517daba70f9f9342e09d7a6322ba323] iommu/tegra-smmu: Drop unnecessary error check for for debugfs_create_dir()
> Bisecting: 1 revision left to test after this (roughly 1 step)
> [9e13ec61de2a51195b122a79461431d8cb99d7b5] iommu/virtio: Add __counted_by for struct viommu_request and use struct_size()
> Bisecting: 0 revisions left to test after this (roughly 0 steps)
> [6e6c6d6bc6c96c2477ddfea24a121eb5ee12b7a3] iommu: Avoid unnecessary cache invalidations
> e8cca466a84a75f8ff2a7a31173c99ee6d1c59d2 is the first bad commit
> commit e8cca466a84a75f8ff2a7a31173c99ee6d1c59d2
> Merge: 6e6c6d6bc6 f7da9c0815 aa5cabc4ce 9e13ec61de e82c175e63 cedc811c76 3613047280 92bce97f0c
> Author: Joerg Roedel <jroedel@xxxxxxx>
> Date:   Fri Oct 27 09:13:40 2023 +0200
>
>     Merge branches 'iommu/fixes', 'arm/tegra', 'arm/smmu', 'virtio', 'x86/vt-d', 'x86/amd', 'core' and 's390' into next

Do you have the good/bad log on this? It doesn't look like bisect
tested enough stuff to really conclude the merge is the bad thing, at
a minimum it should be testing all the bases of the merge. Do you have
--first-parent set or something?

I would test cedc811c76778bdef91d405717acee0de54d8db5 (x86/amd) and
3613047280ec42a4e1350fdc1a6dd161ff4008cc (core) directly. Most likely
cedc will be bad problem.

If one of them is bad then restart the bisection with that as the bad
and 6e6c6d6bc6 as the good.

(or run bisect again with e8cca466a84a75f8ff2a7a31173c99ee6d1c59d2 as
the bad and 6e6c6d6bc6c96c2477ddfea24a121eb5ee12b7a3 as the good
without --first-parent)

Jason
git bisect start
# bad: [e8cca466a84a75f8ff2a7a31173c99ee6d1c59d2] Merge branches 'iommu/fixes', 'arm/tegra', 'arm/smmu', 'virtio', 'x86/vt-d', 'x86/amd', 'core' and 's390' into next
git bisect bad e8cca466a84a75f8ff2a7a31173c99ee6d1c59d2
# good: [6e6c6d6bc6c96c2477ddfea24a121eb5ee12b7a3] iommu: Avoid unnecessary cache invalidations
git bisect good 6e6c6d6bc6c96c2477ddfea24a121eb5ee12b7a3
# good: [482feb5c649261cd2a7ad02e4ca63c159d6ec795] iommu/dart: Call apple_dart_finalize_domain() as part of alloc_paging()
git bisect good 482feb5c649261cd2a7ad02e4ca63c159d6ec795
# good: [cedc811c76778bdef91d405717acee0de54d8db5] iommu/amd: Remove DMA_FQ type from domain allocation path
git bisect good cedc811c76778bdef91d405717acee0de54d8db5
# good: [aa5cabc4ce8e6b45d170d162dc54b1bac1767c47] Merge tag 'arm-smmu-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/will/linux into arm/smmu
git bisect good aa5cabc4ce8e6b45d170d162dc54b1bac1767c47
# good: [92bce97f0c341d3037b0f364b6839483f6a41cae] s390/pci: Fix reset of IOMMU software counters
git bisect good 92bce97f0c341d3037b0f364b6839483f6a41cae
# good: [e82c175e63229ea495a0a0b5305a98b5b6ee5346] Revert "iommu/vt-d: Remove unused function"
git bisect good e82c175e63229ea495a0a0b5305a98b5b6ee5346
# good: [3613047280ec42a4e1350fdc1a6dd161ff4008cc] Merge tag 'v6.6-rc7' into core
git bisect good 3613047280ec42a4e1350fdc1a6dd161ff4008cc
# good: [f7da9c081517daba70f9f9342e09d7a6322ba323] iommu/tegra-smmu: Drop unnecessary error check for for debugfs_create_dir()
git bisect good f7da9c081517daba70f9f9342e09d7a6322ba323
# good: [9e13ec61de2a51195b122a79461431d8cb99d7b5] iommu/virtio: Add __counted_by for struct viommu_request and use struct_size()
git bisect good 9e13ec61de2a51195b122a79461431d8cb99d7b5
# first bad commit: [e8cca466a84a75f8ff2a7a31173c99ee6d1c59d2] Merge branches 'iommu/fixes', 'arm/tegra', 'arm/smmu', 'virtio', 'x86/vt-d', 'x86/amd', 'core' and 's390' into next

Attachment: 20240415_133212.jpg
Description: JPEG image