Re: [PATCH 1/2] bus: mhi: host: pci_generic: Use pci_try_reset_function() to avoid deadlock

From: Manivannan Sadhasivam
Date: Wed Feb 19 2025 - 08:14:06 EST

Next message: Wang Hai: "Re: [PATCH net] tcp: Fix error ts_recent time during three-way handshake"
Previous message: Jon Hunter: "Re: [PATCH 6.12 000/230] 6.12.16-rc1 review"
Next in thread: Johan Hovold: "Re: [PATCH 1/2] bus: mhi: host: pci_generic: Use pci_try_reset_function() to avoid deadlock"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Jan 22, 2025 at 04:11:41PM +0100, Johan Hovold wrote:
> On Wed, Jan 08, 2025 at 07:09:27PM +0530, Manivannan Sadhasivam via B4 Relay wrote:
> > From: Manivannan Sadhasivam <manivannan.sadhasivam@xxxxxxxxxx>
> >
> > There are multiple places from where the recovery work gets scheduled
> > asynchronously. Also, there are multiple places where the caller waits
> > synchronously for the recovery to be completed. One such place is during
> > the PM shutdown() callback.
> >
> > If the device is not alive during recovery_work, it will try to reset the
> > device using pci_reset_function(). This function internally will take the
> > device_lock() first before resetting the device. By this time, if the lock
> > has already been acquired, then recovery_work will get stalled while
> > waiting for the lock. And if the lock was already acquired by the caller
> > which waits for the recovery_work to be completed, it will lead to
> > deadlock.
> >
> > This is what happened on the X1E80100 CRD device when the device died
> > before shutdown() callback. Driver core calls the driver's shutdown()
> > callback while holding the device_lock() leading to deadlock.
> >
> > And this deadlock scenario can occur on other paths as well, like during
> > the PM suspend() callback, where the driver core would hold the
> > device_lock() before calling driver's suspend() callback. And if the
> > recovery_work was already started, it could lead to deadlock. This is also
> > observed on the X1E80100 CRD.
> >
> > So to fix both issues, use pci_try_reset_function() in recovery_work. This
> > function first checks for the availability of the device_lock() before
> > trying to reset the device. If the lock is available, it will acquire it
> > and reset the device. Otherwise, it will return -EAGAIN. If that happens,
> > recovery_work will fail with the error message "Recovery failed" as not
> > much could be done.
>
> I can confirm that this patch (alone) fixes the deadlock on shutdown
> and suspend as expected, but it does leave the system state that blocks
> further suspend (this is with patches that tear down the PCI link).

Yeah, we wouldn't know when not to return an error to unblock the suspend. So
this is acceptable.

> Looks like the modem is also hosed.
>
> [ 267.454616] mhi-pci-generic 0005:01:00.0: mhi_pci_runtime_resume
> [ 267.461165] mhi mhi0: Resuming from non M3 state (SYS ERROR)
> [ 267.467102] mhi-pci-generic 0005:01:00.0: failed to resume device: -22
> [ 267.473968] mhi-pci-generic 0005:01:00.0: device recovery started
> [ 267.477460] mhi-pci-generic 0005:01:00.0: mhi_pci_suspend
> [ 267.480331] mhi-pci-generic 0005:01:00.0: __mhi_power_down
> [ 267.485917] mhi-pci-generic 0005:01:00.0: mhi_pci_runtime_suspend
> [ 267.498339] mhi-pci-generic 0005:01:00.0: __mhi_power_down - pm mutex taken
> [ 267.505618] mhi-pci-generic 0005:01:00.0: __mhi_power_down - pm lock taken
> [ 267.513372] wwan wwan0: port wwan0qcdm0 disconnected
> [ 267.519556] wwan wwan0: port wwan0mbim0 disconnected
> [ 267.525544] wwan wwan0: port wwan0qmi0 disconnected
> [ 267.573773] mhi-pci-generic 0005:01:00.0: __mhi_power_down - returns
> [ 267.591199] mhi mhi0: Requested to power ON
> [ 267.914688] mhi mhi0: Power on setup success
> [ 267.914875] mhi mhi0: Wait for device to enter SBL or Mission mode
> [ 267.919179] mhi-pci-generic 0005:01:00.0: mhi_sync_power_up - wait event timeout_ms = 8000
> [ 276.189399] mhi-pci-generic 0005:01:00.0: mhi_sync_power_up - wait event returns, ret = -110
> [ 276.198240] mhi-pci-generic 0005:01:00.0: __mhi_power_down
> [ 276.203990] mhi-pci-generic 0005:01:00.0: __mhi_power_down - pm mutex taken
> [ 276.211269] mhi-pci-generic 0005:01:00.0: __mhi_power_down - pm lock taken
> [ 276.220024] mhi-pci-generic 0005:01:00.0: __mhi_power_down - returns
> [ 276.226680] mhi-pci-generic 0005:01:00.0: mhi_sync_power_up - returns
> [ 276.233417] mhi-pci-generic 0005:01:00.0: mhi_pci_recovery_work - mhi unprepare after power down
> [ 276.242604] mhi-pci-generic 0005:01:00.0: mhi_pci_recovery_work - pci reset
> [ 276.249881] mhi-pci-generic 0005:01:00.0: Recovery failed
> [ 276.255536] mhi-pci-generic 0005:01:00.0: mhi_pci_recovery_work - returns
> [ 276.328878] qcom-pcie 1bf8000.pci: qcom_pcie_suspend_noirq
> [ 276.368851] qcom-pcie 1c00000.pci: qcom_pcie_suspend_noirq
> [ 276.477900] qcom-pcie 1c00000.pci: Failed to enter L2/L3
> [ 276.483535] qcom-pcie 1c00000.pci: PM: dpm_run_callback(): genpd_suspend_noirq returns -110
> [ 276.492292] qcom-pcie 1c00000.pci: PM: failed to suspend noirq: error -110
> [ 276.500218] qcom-pcie 1bf8000.pci: qcom_pcie_resume_noirq
> [ 276.721401] qcom-pcie 1bf8000.pci: PCIe Gen.4 x4 link up
> [ 276.730639] PM: noirq suspend of devices failed
> [ 276.818943] mhi-pci-generic 0005:01:00.0: mhi_pci_resume
> [ 276.824582] mhi-pci-generic 0005:01:00.0: mhi_pci_runtime_resume
>
> Still better than hanging the machine, but perhaps not much point in
> having recovery code that can't recover the device.
>

Unfortunately, we cannot know if we could not recover the device.

> We obviously need to track down what is causing the modem to fail to
> resume since 6.13-rc1 too.
>

Yeah, this is worth tracing down.

> > Cc: stable@xxxxxxxxxxxxxxx # 5.12
> > Reported-by: Johan Hovold <johan@xxxxxxxxxx>
> > Closes: https://lore.kernel.org/mhi/Z1me8iaK7cwgjL92@xxxxxxxxxxxxxxxxxxxx
>
> Reviewed-by: Johan Hovold <johan+linaro@xxxxxxxxxx>
> Tested-by: Johan Hovold <johan+linaro@xxxxxxxxxx>
>
> And since I've spent way to much time debugging this and provided the
> analysis of the deadlock:
>
> Analyzed-by: Johan Hovold <johan@xxxxxxxxxx>
> Link: https://lore.kernel.org/mhi/Z2KKjWY2mPen6GPL@xxxxxxxxxxxxxxxxxxxx/
>

Sure.

> > Fixes: 7389337f0a78 ("mhi: pci_generic: Add suspend/resume/recovery procedure")
> > Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@xxxxxxxxxx>
> > ---
> > drivers/bus/mhi/host/pci_generic.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/bus/mhi/host/pci_generic.c b/drivers/bus/mhi/host/pci_generic.c
> > index 07645ce2119a..e92df380c785 100644
> > --- a/drivers/bus/mhi/host/pci_generic.c
> > +++ b/drivers/bus/mhi/host/pci_generic.c
> > @@ -1040,7 +1040,7 @@ static void mhi_pci_recovery_work(struct work_struct *work)
> > err_unprepare:
> > mhi_unprepare_after_power_down(mhi_cntrl);
> > err_try_reset:
> > - if (pci_reset_function(pdev))
> > + if (pci_try_reset_function(pdev))
> > dev_err(&pdev->dev, "Recovery failed\n");
>
> Perhaps you should log the returned error here as a part of this patch
> so we can tell when the recovery code failed due to the device lock
> being held.
>

Makes sense. Added the errno to the log and applied to patch to mhi/next with
your tags. Thanks a lot!

- Mani

--
மணிவண்ணன் சதாசிவம்

Next message: Wang Hai: "Re: [PATCH net] tcp: Fix error ts_recent time during three-way handshake"
Previous message: Jon Hunter: "Re: [PATCH 6.12 000/230] 6.12.16-rc1 review"
Next in thread: Johan Hovold: "Re: [PATCH 1/2] bus: mhi: host: pci_generic: Use pci_try_reset_function() to avoid deadlock"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]