Re: [PATCH v2 3/3] s390/pci: Fix circular/recursive deadlocks in PCI-bus and -device release
From: Niklas Schnelle
Date: Thu Mar 12 2026 - 17:07:49 EST
On Wed, 2026-03-11 at 14:27 +0100, Benjamin Block wrote:
> When removing PCI device or PCI bus objects there are a couple of
> call-chains where it is possible that the kernel runs into a circular or
> recursive deadlock involving taking the central `pci_rescan_remove_lock`.
Now that this depends on Ionut's patch at[0] and exploits the fact that
the rescan/remove lock can be taken recursively we may want to adjust
and thereby shorten this description to take this into account dropping
the recursive locking as deadlock reason.
>
> For example:
>
> (A) Thread α receives a CRW event notifying the kernel that a PCI
Nit: Other PCI related code calls these "PCI events" even though
technically they use the Channel Report Word (CRW) mechanism and thus
are CRW events.
> virtual function has been moved into Reserved state, and so the PCI
> subsystem will try to remove that PCI function. The call-chain for that
> looks like this:
> __zpci_event_availability()
> -> zpci_zdev_put() # will lock(zpci_add_remove_lock),
> # and lock(zpci_list_lock)
> -> zpci_release_device() # will unlock(zpci_list_lock)
> -> zpci_cleanup_bus_resources() # will lock(pci_rescan_remove_lock)
>
> Thread β is triggered by userspace writing 0 into the SysFS attribute
> `sriov_numvfs` of the parent PCI physical function of the same function
> we just try to remove. This will also try to release the PCI virtual
> function; but this time the call-chain looks like this:
> sriov_numvfs_store() # will lock(pci_rescan_remove_lock)
With Ionut's patch a prerequisite the above taking of
pci_rescan_remove_lock now happens in sriov_del_vfs() with a bit
shorter chain but same result.
> -> ... (deep chain)
> -> pci_release_dev()
> -> pcibios_release_device()
> -> zpci_zdev_put() # will lock(zpci_add_remove_lock)
>
> If thread α and β coincide, this will result in a cyclic deadlock.
>
> (B) Consider thread β from above; if the thread was to follow the same
> outlined call-chain for thread α, and not fall into the cyclic deadlock,
> it would recursive deadlock:
> sriov_numvfs_store() # will lock(pci_rescan_remove_lock)
> -> ... (deep chain)
> -> pci_release_dev()
> -> pcibios_release_device()
> -> zpci_zdev_put() # will lock(zpci_add_remove_lock),
> # and lock(zpci_list_lock)
> -> zpci_release_device() # will unlock(zpci_list_lock)
> -> zpci_cleanup_bus_resources() # will lock(pci_rescan_remove_lock)
Since Ionut's patch must come before this, this behavior itself would
be okay (see argument above). Maybve just drop this part?
>
> (C) And even in case `pci_rescan_remove_lock` was removed from
> zpci_cleanup_bus_resources(), the resulting call-chain would recursive
> deadlock when it tries to release the associated PCI bus:
> sriov_numvfs_store() # will lock(pci_rescan_remove_lock)
Same as above after Ionut's change the lock is taken in sriov_del_vfs()
> -> ... (deep chain)
> -> pci_release_dev()
> -> pcibios_release_device()
> -> zpci_zdev_put() # will lock(zpci_add_remove_lock),
> # and lock(zpci_list_lock)
> -> zpci_release_device() # will unlock(zpci_list_lock)
> -> zpci_bus_device_unregister()
> -> zpci_bus_put() # will lock(zbus_list_lock)
> -> zpci_bus_release() # will unlock(zbus_list_lock),
> # will lock(pci_rescan_remove_lock)
Same argument as above for dropping the recursive deadlock part but the
below cyclic deadlock potential remains.
>
> It can also easily be seen that scenario (C) offers the same risk as (A)
> for a cyclic deadlock in cases where `pci_rescan_remove_lock` is first
> locked, and the PCI bus released under its protection.
I think this part should spell out which two locks create the cycle
>
> `pci_rescan_remove_lock` has to be and is taken at a "high level" in
> most call-chains since it is intended to protect/mutual exclude all
> rescan and/or removal actions taken in the PCI subsystem. So to prevent
> the outlined deadlock scenarios above remove it instead from the "low
> level" release function for both the PCI device and PCI bus objects.
>
> Instead, lock `pci_rescan_remove_lock` in all call-chains leading to
> those release functions:
> * initialization of the PCI subsystem;
> * processing of availability events (CRWs) for PCI functions;
> * processing of error events (CRWs) for PCI functions;
> * architecture specific release PCI device implementation.
>
> Additionally, remove `pci_rescan_remove_lock` from zpci_bus_scan_bus()
> since it is now always already taken when called.
Nit: I'd add an explicit statement that the locking order between
pci_rescan_remove_lock and zpci_add_remove_lock is thus reversed with
pci_rescan_remove_lockk always taken first then zpci_add_remove_lock.
Still it is understandable without it too.
>
> Lastly, document the new locking expectations after these changes. Add
> sparse and lockdep annotations to functions that previously locked
> `pci_rescan_remove_lock` explicitly, making sure the lock is now
> already held when called. Additionally also add the annotations to
> zpci_zdev_put() and zpci_bus_put() to make sure that every function that
> potentially drops the last reference already holds the lock to prevent
> surprises.
>
> Fixes: 05bc1be6db4b2 ("s390/pci: create zPCI bus")
> Fixes: ab909509850b2 ("PCI: s390: Fix use-after-free of PCI resources with per-function hotplug")
> Signed-off-by: Benjamin Block <bblock@xxxxxxxxxxxxx>
> ---
> arch/s390/pci/pci.c | 11 ++++++++---
> arch/s390/pci/pci_bus.c | 15 ++++++++-------
> arch/s390/pci/pci_event.c | 28 +++++++++++++++++++---------
> arch/s390/pci/pci_iov.c | 3 +--
> arch/s390/pci/pci_sysfs.c | 9 +++------
> 5 files changed, 39 insertions(+), 27 deletions(-)
>
I had not previously looked at your commit message from the point of
view when it is ordered aferr Ionut's patch[0], which it has to for the
release case. So I ended up with a few comments now even though I
previously read it multiple times and was just very happy with the nice
and readable call chain and locking order explanations.
Either way the code itself looks good to me and I also did quite a bit
of testing with this and the previous versions. Trying hot(un)plug and
SR-IOV with various PCI devices and also just ran this and the previous
version on several test systems for a few days doing other testing and
development work.
I also put quite a bit of effort into trying to modify this and/or
other s390 PCI code to not require the reentrant behavior of the
pci_rescan_remove_lock but failed. The really tricky part to me is how
e.g. a driver can hold on to a struct pci_dev even after it has been
removed while we have to keep both the struct zpci_dev and struct
zpci_bus and associated struct pci_bus exactly until the last
pci_dev_put(). And then needing to remove the PCI bus via
zpci_bus_release() we need to have the pci_rescan_remove lock but now
this can happen, in principle, at any pci_dev_put() and that of course
sometimes happens when we already have the pci_rescan_remove_lock and
sometimes without it.
So feel free to add my:
Reviewed-by: Niklas Schnelle <schnelle@xxxxxxxxxxxxx>
Tested-by: Niklas Schnelle <schnelle@xxxxxxxxxxxxx>
Thanks,
Niklas
[0]https://lore.kernel.org/lkml/20260310074303.17480-2-ionut.nechita@xxxxxxxxxxxxx/