Hit a deadlock: between AER and pcieport/pciehp

From: Rajat Jain
Date: Wed Mar 11 2015 - 21:48:39 EST


I have hit a kernel deadlock situation on my system that has
hierarchical hot plug situations (i.e. we can hot-plug a card, that
itself may have a hot-plug slot for another level of hot-pluggable
add-on cards). In summary, I see 2 threads that are both waiting on
mutexes that is acquired by the other one. The mutexes are the
(global) "pci_bus_sem" and "device->mutex" respectively.

This is the pciehp worker thread, that scans a new card, and on
finding that there is a hotplug slot downstream, tries to
-> pciehp_enable_slot()
-> pciehp_configure_device()
-> pci_bus_add_devices() discovers all devices including a new
hotplug slot.
-> ....(etc)...
-> device_attach(dev) (for the newly discovered HP slot /
downstream port)
-> device_lock(dev) SUCCESSFULLY ACQUIRES dev->mutex for
the new slot.
-> ....(etc)...
-> ... (goes on)
-> pciehp_probe(dev)
-> __pci_hp_register()
-> pci_create_slot()
-> down_write(pci_bus_sem); /* Deadlocked */

This how the stack looks like:
[<ffffffff814e9923>] call_rwsem_down_write_failed+0x13/0x20
[<ffffffff81522d4f>] pci_create_slot+0x3f/0x280
[<ffffffff8152c030>] __pci_hp_register+0x70/0x400
[<ffffffff8152cf49>] pciehp_probe+0x1a9/0x450
[<ffffffff8152865d>] pcie_port_probe_service+0x3d/0x90
[<ffffffff815c45b9>] driver_probe_device+0xf9/0x350
[<ffffffff815c490b>] __device_attach+0x4b/0x60
[<ffffffff815c25a6>] bus_for_each_drv+0x56/0xa0
[<ffffffff815c4468>] device_attach+0xa8/0xc0
[<ffffffff815c38d0>] bus_probe_device+0xb0/0xe0
[<ffffffff815c16ce>] device_add+0x3de/0x560
[<ffffffff815c1a2e>] device_register+0x1e/0x30
[<ffffffff81528aef>] pcie_port_device_register+0x32f/0x510
[<ffffffff81528eb8>] pcie_portdrv_probe+0x48/0x80
[<ffffffff8151b17c>] pci_device_probe+0x9c/0xf0
[<ffffffff815c45b9>] driver_probe_device+0xf9/0x350
[<ffffffff815c490b>] __device_attach+0x4b/0x60
[<ffffffff815c25a6>] bus_for_each_drv+0x56/0xa0
[<ffffffff815c4468>] device_attach+0xa8/0xc0
[<ffffffff815116c1>] pci_bus_add_device+0x41/0x70
[<ffffffff81511a41>] pci_bus_add_devices+0x41/0x90
[<ffffffff81511a6f>] pci_bus_add_devices+0x6f/0x90
[<ffffffff8152e7e2>] pciehp_configure_device+0xa2/0x140
[<ffffffff8152df08>] pciehp_enable_slot+0x188/0x2d0
[<ffffffff8152e3d1>] pciehp_power_thread+0x2b1/0x3c0
[<ffffffff810d92a0>] process_one_work+0x1d0/0x510
[<ffffffff810d9cc1>] worker_thread+0x121/0x440
[<ffffffff810df0bf>] kthread+0xef/0x110
[<ffffffff81a4d8ac>] ret_from_fork+0x7c/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff

While the above thread is doing its work, the root port gets a
completion timeout. And thus the AER Error recovery worker thread
kicks in to handle that error. And as part of that error recovery -
since the completion timeout was detected at root port, attempts to
see for ALL the devices downstream if they have an error handler that
need to be called. Here is what happens:

-> aer_isr_one_error()
-> aer_process_err_device()
-> ... (etc)...
-> do_recovery()
-> broadcast_error_message()
-> pci_walk_bus( ..., report_error_detected,...) /*
effectively for all buses below root port */
-> down_read(&pci_bus_sem); /* SUCCESSFULLY
ACQUIRES the semaophore */
-> report_error_detected(dev) /* for the newly
detected slot */
-> device_lock(dev) /* Deadlocked */

This is how the stack looks like:
[<ffffffff81529e7e>] report_error_detected+0x4e/0x170 <--- Waiting on
[<ffffffff8151162e>] pci_walk_bus+0x4e/0xa0
[<ffffffff81529b84>] broadcast_error_message+0xc4/0xf0
[<ffffffff81529bed>] do_recovery+0x3d/0x280
[<ffffffff8152a5d0>] aer_isr+0x300/0x3e0
[<ffffffff810d92a0>] process_one_work+0x1d0/0x510
[<ffffffff810d9cc1>] worker_thread+0x121/0x440
[<ffffffff810df0bf>] kthread+0xef/0x110
[<ffffffff81a4d8ac>] ret_from_fork+0x7c/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff

As a temporary work around to let me proceed, I was thinking may be I
could change in report_error_detected() such that completion timeouts
errors may not be broadcast (do we really have any drivers that have
aer handlers that handle such an error? What would the handler do
anyway to fix such an error?)

But not sure what the right solution might look like. I thought about
whether these locks should have been taken in a particular order in
order to avoid this problem, but looking at the stack there seems to
be no other way. What do you think is the best way to fix this

Any help or suggestions in this regard are greatly appreciated.


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/