Re: [PATCH v4] remoteproc: core: do pm relax when in RPROC_OFFLINE

From: Aiqun(Maria) Yu
Date: Sun Oct 23 2022 - 23:21:15 EST

Next message: ruansy.fnst@xxxxxxxxxxx: "Re: [PATCH] xfs: fail dax mount if reflink is enabled on a partition"
Previous message: Paul E. McKenney: "Re: [PATCH rcu 13/14] workqueue: Make queue_rcu_work() use call_rcu_flush()"
In reply to: Mathieu Poirier: "Re: [PATCH v4] remoteproc: core: do pm relax when in RPROC_OFFLINE"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 10/22/2022 3:34 AM, Mathieu Poirier wrote:

On Wed, 19 Oct 2022 at 23:52, Aiqun(Maria) Yu <quic_aiquny@xxxxxxxxxxx> wrote:

On 10/14/2022 2:03 AM, Mathieu Poirier wrote:

On Thu, Oct 13, 2022 at 11:34:42AM -0600, Mathieu Poirier wrote:

On Thu, Oct 13, 2022 at 09:40:09AM +0800, Aiqun(Maria) Yu wrote:

Hi Mathieu,

On 10/13/2022 4:43 AM, Mathieu Poirier wrote:

Please add what has changed from one version to another, either in a cover
letter or after the "Signed-off-by". There are many examples on how to do that
on the mailing list.

Thx for the information, will take a note and benefit for next time.

On Fri, Sep 16, 2022 at 03:12:31PM +0800, Maria Yu wrote:

RPROC_OFFLINE state indicate there is no recovery process
is in progress and no chance to do the pm_relax.
Because when recovering from crash, rproc->lock is held and
state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
and then unlock rproc->lock.

You are correct - because the lock is held rproc->state should be set to RPROC_RUNNING
when rproc_trigger_recovery() returns. If that is not the case then something
went wrong.

Function rproc_stop() sets rproc->state to RPROC_OFFLINE just before returning,
so we know the remote processor was stopped. Therefore if rproc->state is set
to RPROC_OFFLINE something went wrong in either request_firmware() or
rproc_start(). Either way the remote processor is offline and the system probably
in an unknown/unstable. As such I don't see how calling pm_relax() can help
things along.

PROC_OFFLINE is possible that rproc_shutdown is triggered and successfully
finished.
Even if it is multi crash rproc_crash_handler_work contention issue, and
last rproc_trigger_recovery bailed out with only
rproc->state==RPROC_OFFLINE, it is still worth to do pm_relax in pair.
Since the subsystem may still can be recovered with customer's next trigger
of rproc_start, and we can make each error out path clean with pm resources.

I suggest spending time understanding what leads to the failure when recovering
from a crash and address that problem(s).

In current case, the customer's information is that the issue happened when
rproc_shutdown is triggered at similar time. So not an issue from error out
of rproc_trigger_recovery.

That is a very important element to consider and should have been mentioned from
the beginning. What I see happening is the following:

rproc_report_crash()
pm_stay_awake()
queue_work() // current thread is suspended

rproc_shutdown()
rproc_stop()
rproc->state = RPROC_OFFLINE;

rproc_crash_handler_work()
if (rproc->state == RPROC_OFFLINE)
return // pm_relax() is not called

The right way to fix this is to add a pm_relax() in rproc_shutdown() and
rproc_detach(), along with a very descriptive comment as to why it is needed.

Thinking about this further there are more ramifications to consider. Please
confirm the above scenario is what you are facing. I will advise on how to move
forward if that is the case.

Not sure if the situation is clear or not. So resend the email again.

The above senario is what customer is facing. crash hanppened while at
the same time shutdown is triggered.

Unfortunately this is not enough details to address a problem as
complex as this one.

And the device cannto goes to suspend state after that.
the subsystem can still be start normally after this.

If the code flow I pasted above reflects the problem at hand, the
current patch will not be sufficient to address the issue. If Arnaud
confirms my suspicions we will have to think about a better solution.

Thanks,
Mathieu

When the state is in RPROC_OFFLINE it means separate request
of rproc_stop was done and no need to hold the wakeup source
in crash handler to recover any more.

Signed-off-by: Maria Yu <quic_aiquny@xxxxxxxxxxx>
---
drivers/remoteproc/remoteproc_core.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
index e5279ed9a8d7..6bc7b8b7d01e 100644
--- a/drivers/remoteproc/remoteproc_core.c
+++ b/drivers/remoteproc/remoteproc_core.c
@@ -1956,6 +1956,17 @@ static void rproc_crash_handler_work(struct work_struct *work)
if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
/* handle only the first crash detected */
mutex_unlock(&rproc->lock);
+ /*
+ * RPROC_OFFLINE state indicate there is no recovery process
+ * is in progress and no chance to have pm_relax in place.
+ * Because when recovering from crash, rproc->lock is held and
+ * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
+ * and then unlock rproc->lock.
+ * RPROC_OFFLINE is only an intermediate state in recovery
+ * process.
+ */
+ if (rproc->state == RPROC_OFFLINE)
+ pm_relax(rproc->dev.parent);
return;
}
--
2.7.4

--
Thx and BRs,
Aiqun(Maria) Yu

--
Thx and BRs,
Aiqun(Maria) Yu

--
Thx and BRs,
Aiqun(Maria) Yu

Next message: ruansy.fnst@xxxxxxxxxxx: "Re: [PATCH] xfs: fail dax mount if reflink is enabled on a partition"
Previous message: Paul E. McKenney: "Re: [PATCH rcu 13/14] workqueue: Make queue_rcu_work() use call_rcu_flush()"
In reply to: Mathieu Poirier: "Re: [PATCH v4] remoteproc: core: do pm relax when in RPROC_OFFLINE"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]