Re: xhci_pci & PCIe hotplug crash

From: Pali Rohár
Date: Wed May 05 2021 - 08:33:54 EST


On Wednesday 05 May 2021 14:09:17 Greg KH wrote:
> On Wed, May 05, 2021 at 02:01:17PM +0200, Pali Rohár wrote:
> > Hello!
> >
> > During debugging of pci-aardvark.c driver I got following synchronous
> > external abort 96000210 which I can reproduce with VIA XHCI controller
> > when PCIe hot plug support is enabled in kernel and PCIe Root Bridge
> > triggers link down event via PCIe hot plug interrupt.
> >
> > [ 71.773033] pcieport 0000:00:00.0: pciehp: Slot(0): Link Down
> > [ 71.779120] xhci_hcd 0000:01:00.0: remove, state 4
> > [ 71.784113] usb usb5: USB disconnect, device number 1
> > [ 71.790398] xhci_hcd 0000:01:00.0: USB bus 5 deregistered
> > [ 72.511899] Internal error: synchronous external abort: 96000210 [#1] SMP
> > [ 72.518918] Modules linked in:
> > [ 72.522074] CPU: 1 PID: 988 Comm: irq/53-pciehp Not tainted 5.12.0-dirty #949
> > [ 72.536983] pstate: 60000085 (nZCv daIf -PAN -UAO -TCO BTYPE=--)
> > [ 72.543182] pc : xhci_irq+0x70/0x17b8
> > [ 72.546972] lr : xhci_irq+0x28/0x17b8
> > [ 72.550752] sp : ffffffc012b8bab0
> > [ 72.554167] x29: ffffffc012b8bab0 x28: 00000000000000a0
> > [ 72.559652] x27: 0000000000000060 x26: ffffff8000af2250
> > [ 72.565135] x25: ffffffc0100b0d48 x24: ffffffc0100b0be0
> > [ 72.570620] x23: ffffff80003be028 x22: ffffff8000af229c
> > [ 72.576104] x21: 0000000000000080 x20: ffffff8000af2000
> > [ 72.581587] x19: ffffff8000af2000 x18: 0000000000000004
> > [ 72.587071] x17: 0000000000000000 x16: 0000000000000000
> > [ 72.592553] x15: ffffffc01154cc70 x14: ffffff8001751df8
> > [ 72.598037] x13: 0000000000000000 x12: 0000000000000000
> > [ 72.603519] x11: ffffff8001751da8 x10: ffffffc01154cc78
> > [ 72.609001] x9 : ffffffc01087c238 x8 : 0000000000000000
> > [ 72.614485] x7 : ffffffc01162c4e0 x6 : 0000000000000000
> > [ 72.619967] x5 : fffffffe00085000 x4 : fffffffe00085000
> > [ 72.625451] x3 : 0000000000000000 x2 : 0000000000000001
> > [ 72.630933] x1 : ffffffc0118bd024 x0 : 0000000000000000
> > [ 72.636415] Call trace:
> > [ 72.638936] xhci_irq+0x70/0x17b8
> > [ 72.642360] usb_hcd_irq+0x34/0x50
> > [ 72.645876] usb_hcd_pci_remove+0x78/0x138
> > [ 72.650106] xhci_pci_remove+0x6c/0xa8
> > [ 72.653978] pci_device_remove+0x44/0x108
> > [ 72.658122] device_release_driver_internal+0x110/0x1e0
> > [ 72.663521] device_release_driver+0x1c/0x28
> > [ 72.667931] pci_stop_bus_device+0x84/0xc0
> > [ 72.672162] pci_stop_and_remove_bus_device+0x1c/0x30
> > [ 72.677373] pciehp_unconfigure_device+0x98/0xf8
> > [ 72.682138] pciehp_disable_slot+0x60/0x118
> > [ 72.686457] pciehp_handle_presence_or_link_change+0xec/0x3b0
> > [ 72.692386] pciehp_ist+0x170/0x1a0
> > [ 72.695984] irq_thread_fn+0x30/0x90
> > [ 72.699674] irq_thread+0x13c/0x200
> > [ 72.703271] kthread+0x12c/0x130
> > [ 72.706603] ret_from_fork+0x10/0x1c
> > [ 72.710299] Code: 35ffff83 35002741 f9400f41 91001021 (b9400021)
> > [ 72.716586] ---[ end trace 20ce3e30ff292c93 ]---
> > [ 72.721453] genirq: exiting task "irq/53-pciehp" (988) is an active IRQ thread (irq 53)
> > [ 72.730068] sched: RT throttling activated
> >
> > And after that kernel is in some semi-broken state. Some functionality
> > works, but some other (like reboot) does not.
> >
> > I can reproduce it also when I manually inject/fake this link down PCIe
> > hot plug interrupt with setting corresponding bits in PCIe Root Status
> > registers, so pciehp driver thinks that link down even occurred.
> >
> > I suspect that issue is in usb_hcd_pci_remove() function which calls
> > local_irq_disable()+usb_hcd_irq()+local_irq_enable() functions but do
> > not take into care that whole usb_hcd_pci_remove() function may be
> > called from interrupt context.
>
> usb_hcd_pci_remove() should NOT be called from interrupt context.
>
> What is causing that to happen?

PCIe Hot Plug interrupt with PCI_EXP_SLTSTA_DLLSC status bit set.

I can reproduce it by issuing PCIe Hot Reset to PCIe controller (via
setpci from userspace) which resulted in link down event (which is
obvious) and PCIe controller then triggered link down interrupt.

> No PCI driver can handle that, especially USB ones.
>
> > Can you look at this issue if it is really safe to call usb_hcd_irq()
> > from interrupt context? Or rather if it is safe to call functions like
> > pciehp_disable_slot() or device_release_driver() from interrupt context
> > like it can be seen in call trace?
>
> What is removing devices from an irq?

It can be seen in above call trace. It is pciehp_disable_slot() followed
by pciehp_unconfigure_device().

> That is wrong, pci hotplug never used to do that, what recently changed?

I really do not know what was changed recently. I hope that other people
in linux-pci ML would know history details better.

I just spotted this crash during debugging PCIe controller driver
pci-aardvark.c with trying to expose its link down events via "hot plug"
interrupt and corresponding link layer state flags.

And because in whole call trace I see only generic PCIe and USB code
path without any driver specific parts, I suspect that this is not PCIe
controller-specific issue but rather something "wrong" in genetic PCIe
(or USB) code. That is why I sent this email, so maybe somebody else
find something suspicious here.

But still there is a chance that issue can be also in pci-aardvark.c
driver and somehow it masked its issue and propagated it into generic
PCIe hot plug code path.

> thanks,
>
> greg k-h