Re: [PATCH v4] remoteproc: core: do pm relax when in RPROC_OFFLINE

From: Arnaud POULIQUEN
Date: Fri Oct 28 2022 - 11:32:03 EST


Hi,

On 10/24/22 05:17, Aiqun(Maria) Yu wrote:
> On 10/22/2022 3:34 AM, Mathieu Poirier wrote:
>> On Wed, 19 Oct 2022 at 23:52, Aiqun(Maria) Yu <quic_aiquny@xxxxxxxxxxx> wrote:
>>>
>>> On 10/14/2022 2:03 AM, Mathieu Poirier wrote:
>>>> On Thu, Oct 13, 2022 at 11:34:42AM -0600, Mathieu Poirier wrote:
>>>>> On Thu, Oct 13, 2022 at 09:40:09AM +0800, Aiqun(Maria) Yu wrote:
>>>>>> Hi Mathieu,
>>>>>>
>>>>>> On 10/13/2022 4:43 AM, Mathieu Poirier wrote:
>>>>>>> Please add what has changed from one version to another, either in a cover
>>>>>>> letter or after the "Signed-off-by".  There are many examples on how to
>>>>>>> do that
>>>>>>> on the mailing list.
>>>>>>>
>>>>>> Thx for the information, will take a note and benefit for next time.
>>>>>>
>>>>>>> On Fri, Sep 16, 2022 at 03:12:31PM +0800, Maria Yu wrote:
>>>>>>>> RPROC_OFFLINE state indicate there is no recovery process
>>>>>>>> is in progress and no chance to do the pm_relax.
>>>>>>>> Because when recovering from crash, rproc->lock is held and
>>>>>>>> state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
>>>>>>>> and then unlock rproc->lock.
>>>>>>>
>>>>>>> You are correct - because the lock is held rproc->state should be set to
>>>>>>> RPROC_RUNNING
>>>>>>> when rproc_trigger_recovery() returns.  If that is not the case then
>>>>>>> something
>>>>>>> went wrong.
>>>>>>>
>>>>>>> Function rproc_stop() sets rproc->state to RPROC_OFFLINE just before
>>>>>>> returning,
>>>>>>> so we know the remote processor was stopped.  Therefore if rproc->state
>>>>>>> is set
>>>>>>> to RPROC_OFFLINE something went wrong in either request_firmware() or
>>>>>>> rproc_start().  Either way the remote processor is offline and the system
>>>>>>> probably
>>>>>>> in an unknown/unstable.  As such I don't see how calling pm_relax() can help
>>>>>>> things along.
>>>>>>>
>>>>>> PROC_OFFLINE is possible that rproc_shutdown is triggered and successfully
>>>>>> finished.
>>>>>> Even if it is multi crash rproc_crash_handler_work contention issue, and
>>>>>> last rproc_trigger_recovery bailed out with only
>>>>>> rproc->state==RPROC_OFFLINE, it is still worth to do pm_relax in pair.
>>>>>> Since the subsystem may still can be recovered with customer's next trigger
>>>>>> of rproc_start, and we can make each error out path clean with pm resources.
>>>>>>
>>>>>>> I suggest spending time understanding what leads to the failure when
>>>>>>> recovering
>>>>>>> from a crash and address that problem(s).
>>>>>>>
>>>>>> In current case, the customer's information is that the issue happened when
>>>>>> rproc_shutdown is triggered at similar time. So not an issue from error out
>>>>>> of rproc_trigger_recovery.
>>>>>
>>>>> That is a very important element to consider and should have been mentioned
>>>>> from
>>>>> the beginning.  What I see happening is the following:
>>>>>
>>>>> rproc_report_crash()
>>>>>           pm_stay_awake()
>>>>>           queue_work() // current thread is suspended
>>>>>
>>>>> rproc_shutdown()
>>>>>           rproc_stop()
>>>>>                   rproc->state = RPROC_OFFLINE;
>>>>>
>>>>> rproc_crash_handler_work()
>>>>>           if (rproc->state == RPROC_OFFLINE)
>>>>>                   return // pm_relax() is not called
>>>>>
>>>>> The right way to fix this is to add a pm_relax() in rproc_shutdown() and
>>>>> rproc_detach(), along with a very descriptive comment as to why it is needed.
>>>>
>>>> Thinking about this further there are more ramifications to consider.  Please
>>>> confirm the above scenario is what you are facing.  I will advise on how to
>>>> move
>>>> forward if that is the case.
>>>>
>>> Not sure if the situation is clear or not. So resend the email again.
>>>
>>> The above senario is what customer is facing. crash hanppened while at
>>> the same time shutdown is triggered.
>>
>> Unfortunately this is not enough details to address a problem as
>> complex as this one.
>>
>>> And the device cannto goes to suspend state after that.
>>> the subsystem can still be start normally after this.
>>
>> If the code flow I pasted above reflects the problem at hand, the
>> current patch will not be sufficient to address the issue.  If Arnaud
>> confirms my suspicions we will have to think about a better solution.
>>
>
> Hi Mathiew,
>
> Could you pls have more details of any side effects other then power issue of
> the current senario?
> Why the current patch is not sufficient pls?
>
>
> Have the current senario in details with rproc->lock information in details:
>
> | subsystem crashed interrupt issued      | user trigger shutdown
> | rproc_report_crash()                    |
> |          pm_stay_awake()                |
> |          queue_work()                   |
> |                                         |rproc_shutdown
> |                                         |mutex_lock(&rproc->lock);
> |                                         |rproc_stop()
> |rproc_crash_handler_work()               |rproc->state = RPROC_OFFLINE;
> |                                         |mutex_unlock(&rproc->lock);
> |mutex_lock(&rproc->lock);                |
> |if (rproc->state == RPROC_OFFLINE)       |
> |return // pm_relax() is not called       |rproc_boot
> |mutex_unlock(&rproc->lock);              |
> |                                         |mutex_lock(&rproc->lock);
> |                                         |rproc_start()
> |                                         |mutex_unlock(&rproc->lock);
>
>

Agree with Mathieu, this is not so simple.

Here is my view hoping I haven't missed a point in your discussion or
an other corner cases.

I tried to analyze the issues (in what follows, the term "condition" means
the "if" condition in which Aiqun proposes to add the fix) :

I can see 4 use cases with race condition

1) crash report while already one is treated (rproc_boot_recovery called)
=> not a real use case as if the remote processor is crashed we
should not have a second crash report

2) rproc_stop executed between the queuing of the crash work and the call of
rproc_crash_handler_work
=> rproc->state = RPROC_OFFLINE
=> we enter in the "condition" and the pm_relax has to be called
=> This commit fix should solve this use case

3) rproc_detach executed between the queue of the crash work and the call of
rproc_crash_handler_work
=> rproc->state = RPROC_DETACHED;
=> we don't go in "the condition" and issue because the recovery reattach
to the remote processor
=> but pm_relax is called
=> probably need an extra fix to avoid to re-attach

4) crash report while already one is treated (rproc_attach_recovery called)
this one corresponds to an auto reboot of the remote processor, with a
new crash
=> rproc->state = RPROC_CRASHED or rproc->state = RPROC_DETACHED;
4)a) rproc->state = RPROC_CRASHED if rproc->recovery_disabled = true
=> should call pm_relax if rproc->recovery_disabled = true
=> commit does not work for this use case

4)b) rproc->state = RPROC_DETACHED if recovery fails
=> error case with an unstable state
=> how to differentiate it from the use case 3) ?
=> introduce a RPROC_RECOVERY_FAIL state?


Then pm_stay_awake is called when the crash work is queued.
It seems to me coherent to call the pm_relax in the work handler.



Here is a quick and dirty patch (not tested) that should take into account the
main use cases ( except 1) and 4)b) )

@@ -2009,8 +2009,18 @@ static void rproc_crash_handler_work(struct work_struct *work)

mutex_lock(&rproc->lock);

- if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE) {
+ if (rproc->state == RPROC_CRASHED || rproc->state == RPROC_OFFLINE ||
+ rproc->state == RPROC_DETACHED) {
/* handle only the first crash detected */
+
+ /*
+ * call pm-relax in following use cases:
+ * - the remote processor has been stopped by the user
+ * - the remote processor is detached
+ + - the remote proc has an autonomous reset but recovery_disabled is true.
+ */
+ if(rproc->state != RPROC_CRASHED || rproc->recovery_disabled)
+ pm_relax(rproc->dev.parent);
mutex_unlock(&rproc->lock);
return;
}

Regards,
Arnaud

>>>
>>>>>
>>>>>
>>>>>>> Thanks,
>>>>>>> Mathieu
>>>>>>>
>>>>>>>
>>>>>>>> When the state is in RPROC_OFFLINE it means separate request
>>>>>>>> of rproc_stop was done and no need to hold the wakeup source
>>>>>>>> in crash handler to recover any more.
>>>>>>>>
>>>>>>>> Signed-off-by: Maria Yu <quic_aiquny@xxxxxxxxxxx>
>>>>>>>> ---
>>>>>>>>     drivers/remoteproc/remoteproc_core.c | 11 +++++++++++
>>>>>>>>     1 file changed, 11 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/drivers/remoteproc/remoteproc_core.c
>>>>>>>> b/drivers/remoteproc/remoteproc_core.c
>>>>>>>> index e5279ed9a8d7..6bc7b8b7d01e 100644
>>>>>>>> --- a/drivers/remoteproc/remoteproc_core.c
>>>>>>>> +++ b/drivers/remoteproc/remoteproc_core.c
>>>>>>>> @@ -1956,6 +1956,17 @@ static void rproc_crash_handler_work(struct
>>>>>>>> work_struct *work)
>>>>>>>>            if (rproc->state == RPROC_CRASHED || rproc->state ==
>>>>>>>> RPROC_OFFLINE) {
>>>>>>>>                    /* handle only the first crash detected */
>>>>>>>>                    mutex_unlock(&rproc->lock);
>>>>>>>> +         /*
>>>>>>>> +          * RPROC_OFFLINE state indicate there is no recovery process
>>>>>>>> +          * is in progress and no chance to have pm_relax in place.
>>>>>>>> +          * Because when recovering from crash, rproc->lock is held and
>>>>>>>> +          * state is RPROC_CRASHED -> RPROC_OFFLINE -> RPROC_RUNNING,
>>>>>>>> +          * and then unlock rproc->lock.
>>>>>>>> +          * RPROC_OFFLINE is only an intermediate state in recovery
>>>>>>>> +          * process.
>>>>>>>> +          */
>>>>>>>> +         if (rproc->state == RPROC_OFFLINE)
>>>>>>>> +                 pm_relax(rproc->dev.parent);
>>>>>>>>                    return;
>>>>>>>>            }
>>>>>>>> --
>>>>>>>> 2.7.4
>>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Thx and BRs,
>>>>>> Aiqun(Maria) Yu
>>>
>>>
>>> --
>>> Thx and BRs,
>>> Aiqun(Maria) Yu
>
>