Re: Linux v5.5 serious PCI bug
From: Nicholas Johnson
Date: Tue Dec 10 2019 - 07:01:12 EST
On Tue, Dec 10, 2019 at 09:28:00AM +0200, mika.westerberg@xxxxxxxxxxxxxxx wrote:
> On Mon, Dec 09, 2019 at 01:33:49PM +0000, Nicholas Johnson wrote:
> > On Mon, Dec 09, 2019 at 03:12:39PM +0200, mika.westerberg@xxxxxxxxxxxxxxx wrote:
> > > On Mon, Dec 09, 2019 at 12:34:04PM +0000, Nicholas Johnson wrote:
> > > > Hi,
> > > >
> > > > I have compiled Linux v5.5-rc1 and thought all was good until I
> > > > hot-removed a Gigabyte Aorus eGPU from Thunderbolt. The driver for the
> > > > GPU was not loaded (blacklisted) so the crash is nothing to do with the
> > > > GPU driver.
> > > >
> > > > We had:
> > > > - kernel NULL pointer dereference
> > > > - refcount_t: underflow; use-after-free.
> > > >
> > > > Attaching dmesg for now; will bisect and come back with results.
> > >
> > > Looks like something related to iommu. Does it work if you disable it?
> > > (intel_iommu=off in the command line).
> > On Mon, Dec 09, 2019 at 03:12:39PM +0200, mika.westerberg@xxxxxxxxxxxxxxx wrote:
> > > On Mon, Dec 09, 2019 at 12:34:04PM +0000, Nicholas Johnson wrote:
> > > > Hi,
> > > >
> > > > I have compiled Linux v5.5-rc1 and thought all was good until I
> > > > hot-removed a Gigabyte Aorus eGPU from Thunderbolt. The driver for the
> > > > GPU was not loaded (blacklisted) so the crash is nothing to do with the
> > > > GPU driver.
> > > >
> > > > We had:
> > > > - kernel NULL pointer dereference
> > > > - refcount_t: underflow; use-after-free.
> > > >
> > > > Attaching dmesg for now; will bisect and come back with results.
> > >
> > > Looks like something related to iommu. Does it work if you disable it?
> > > (intel_iommu=off in the command line).
> > I thought it could be that, too.
> >
> > The attachment "dmesg-4" from the original email is with iommu parameters.
> > The attachment "dmesg-5" from the original email is with no iommu parameters.
> > Attaching here "dmesg-6" with the iommu explicitly set off like you said.
> >
> > No difference, still broken. Although, with iommu off, there are less stack traces.
> >
> > Could it be sysfs-related?
>
> Bisect would probably be the best option to find the culprit commit.
> There are couple of commits done for pciehp so reverting them one by one
> may help as well:
>
> 87d0f2a5536f PCI: pciehp: Prevent deadlock on disconnect
> 75fcc0ce72e5 PCI: pciehp: Do not disable interrupt twice on suspend
> b94ec12dfaee PCI: pciehp: Refactor infinite loop in pcie_poll_cmd()
> 157c1062fcd8 PCI: pciehp: Avoid returning prematurely from sysfs requests
You are not going to believe this. The offending commit is in the SOUND
subsystem. I thought I had messed up the bisect when only sound commits
were showing near the end.
And yes, I double checked.
Reverted, compiled, tested that it started working.
Reapplied, compiled, tested that it stopped working.
Twice.
The following is the culprit responsible for the issues:
commit 586bc4aab878efcf672536f0cdec3d04b6990c94
Author: Alex Deucher <alexander.deucher@xxxxxxx>
Date: Fri Nov 22 16:43:50 2019 -0500
ALSA: hda/hdmi - fix vgaswitcheroo detection for AMD
It is playing with PCI devices. Clearly they did not consider
hot-removal. I am guessing it is seeing the audio PCI func of the AMD
card in that Thunderbolt eGPU enclosure.
I will collect information, make a bugzilla report and contact the AMD
team. If anybody wants to be cc'd in then let me know. Sorry for
bothering you and Bjorn with something which actually has nothing
directly to do with the PCI subsystem or Thunderbolt.
I strongly hope that the upcoming Intel Xe GPU driver allows for
surprise-removal in Linux without any crashing of kernel or userspace.
The amdgpu and nouveau drivers do not take to surprise removal kindly,
even without the above sound bug applying to AMD.
Kind regards,
Nicholas