Re: [PATCH 1/1] remoteproc: use freezable workqueue for crash notifications

From: Bjorn Andersson
Date: Thu May 27 2021 - 23:55:20 EST


On Wed 19 May 18:44 CDT 2021, Alex Elder wrote:

> When a remoteproc has crashed, rproc_report_crash() is called to
> handle whatever recovery is desired. This can happen at almost any
> time, often triggered by an interrupt, though it can also be
> initiated by a write to debugfs file remoteproc/remoteproc*/crash.
>
> When a crash is reported, the crash handler worker is scheduled to
> run (rproc_crash_handler_work()). One thing that worker does is
> call rproc_trigger_recovery(), which calls rproc_stop(). That calls
> the ->stop method for any remoteproc subdevices before making the
> remote processor go offline.
>
> The Q6V5 modem remoteproc driver implements an SSR subdevice that
> notifies registered drivers when the modem changes operational state
> (prepare, started, stop/crash, unprepared). The IPA driver
> registers to receive these notifications.
>
> With that as context, I'll now describe the problem.
>
> There was a situation in which buggy modem firmware led to a modem
> crash very soon after system (AP) resume had begun. The crash caused
> a remoteproc SSR crash notification to be sent to the IPA driver.
> The problem was that, although system resume had begun, it had not
> yet completed, and the IPA driver was still in a suspended state.
>
> This scenario could happen to any driver that registers for these
> SSR notifications, because they are delivered without knowledge of
> the (suspend) state of registered recipient drivers.
>
> This patch offers a simple fix for this, by having the crash
> handling worker function run on the system freezable workqueue.
> This workqueue does not operate if user space is frozen (for
> suspend). As a result, the SSR subdevice only delivers its
> crash notification when the system is fully operational (i.e.,
> neither suspended nor in suspend/resume transition).
>

This makes sense to me; both that it ensures that we spend our resources
on the actual system resume and that it avoids surprises from this
happening while the system still is in a funky state...

Reviewed-by: Bjorn Andersson <bjorn.andersson@xxxxxxxxxx>

But it would be nice to get some input from other users of the
framework.

Regards,
Bjorn

> Signed-off-by: Alex Elder <elder@xxxxxxxxxx>
> ---
> drivers/remoteproc/remoteproc_core.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/remoteproc/remoteproc_core.c b/drivers/remoteproc/remoteproc_core.c
> index 39cf44cb08035..6bedf2d2af239 100644
> --- a/drivers/remoteproc/remoteproc_core.c
> +++ b/drivers/remoteproc/remoteproc_core.c
> @@ -2724,8 +2724,8 @@ void rproc_report_crash(struct rproc *rproc, enum rproc_crash_type type)
> dev_err(&rproc->dev, "crash detected in %s: type %s\n",
> rproc->name, rproc_crash_to_string(type));
>
> - /* create a new task to handle the error */
> - schedule_work(&rproc->crash_handler);
> + /* Have a worker handle the error; ensure system is not suspended */
> + queue_work(system_freezable_wq, &rproc->crash_handler);
> }
> EXPORT_SYMBOL(rproc_report_crash);
>
> --
> 2.27.0
>