Re: Re: [PATCH] drm/panfrost: fix runtime pm imbalance on error

From: dinghao . liu
Date: Thu May 21 2020 - 03:00:44 EST


Hi Steve,

There are two bailing out points in panfrost_job_hw_submit(): one is
the error path beginning from pm_runtime_get_sync(), the other one is
the error path beginning from WARN_ON() in the if statement. The pm
imbalance fixed in this patch is between these two paths. I think the
caller of panfrost_job_hw_submit() cannot distinguish this imbalance
outside this function.

panfrost_job_timedout() calls pm_runtime_put_noidle() for every job it
finds, but all jobs are added to the pfdev->jobs just before calling
panfrost_job_hw_submit(). Therefore I think the imbalance still exists.
But I'm not very sure if we should add pm_runtime_put on the error path
after pm_runtime_get_sync(), or remove pm_runtime_put one the error path
after WARN_ON().

As for the problem about panfrost_devfreq_record_busy(), this may be a
new bug and requires independent patch to fix it.

Regards,
Dinghao


> On 20/05/2020 12:05, Dinghao Liu wrote:
> > pm_runtime_get_sync() increments the runtime PM usage counter even
> > the call returns an error code. Thus a pairing decrement is needed
> > on the error handling path to keep the counter balanced.
> >
> > Signed-off-by: Dinghao Liu <dinghao.liu@xxxxxxxxxx>
>
> Actually I think we have the opposite problem. To be honest we don't
> handle this situation very well. By the time panfrost_job_hw_submit() is
> called the job has already been added to the pfdev->jobs array, so it's
> considered submitted even if it never actually lands on the hardware. So
> in the case of this function bailing out early we will then (eventually)
> hit a timeout and trigger a GPU reset.
>
> panfrost_job_timedout() iterates through the pfdev->jobs array and calls
> pm_runtime_put_noidle() for each job it finds. So there's no inbalance
> here that I can see.
>
> Have you actually observed the situation where pm_runtime_get_sync()
> returns a failure?
>
> HOWEVER, it appears that by bailing out early the call to
> panfrost_devfreq_record_busy() is never made, which as far as I can see
> means that there may be an extra call to panfrost_devfreq_record_idle()
> when the jobs have timed out. Which could underflow the counter.
>
> But equally looking at panfrost_job_timedout(), we only call
> panfrost_devfreq_record_idle() *once* even though multiple jobs might be
> processed.
>
> There's a completely untested patch below which in theory should fix that...
>
> Steve
>
> ----8<---
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c
> b/drivers/gpu/drm/panfrost/panfrost_job.c
> index 7914b1570841..f9519afca29d 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -145,6 +145,8 @@ static void panfrost_job_hw_submit(struct
> panfrost_job *job, int js)
> u64 jc_head = job->jc;
> int ret;
>
> + panfrost_devfreq_record_busy(pfdev);
> +
> ret = pm_runtime_get_sync(pfdev->dev);
> if (ret < 0)
> return;
> @@ -155,7 +157,6 @@ static void panfrost_job_hw_submit(struct
> panfrost_job *job, int js)
> }
>
> cfg = panfrost_mmu_as_get(pfdev, &job->file_priv->mmu);
> - panfrost_devfreq_record_busy(pfdev);
>
> job_write(pfdev, JS_HEAD_NEXT_LO(js), jc_head & 0xFFFFFFFF);
> job_write(pfdev, JS_HEAD_NEXT_HI(js), jc_head >> 32);
> @@ -410,12 +411,12 @@ static void panfrost_job_timedout(struct
> drm_sched_job *sched_job)
> for (i = 0; i < NUM_JOB_SLOTS; i++) {
> if (pfdev->jobs[i]) {
> pm_runtime_put_noidle(pfdev->dev);
> + panfrost_devfreq_record_idle(pfdev);
> pfdev->jobs[i] = NULL;
> }
> }
> spin_unlock_irqrestore(&pfdev->js->job_lock, flags);
>
> - panfrost_devfreq_record_idle(pfdev);
> panfrost_device_reset(pfdev);
>
> for (i = 0; i < NUM_JOB_SLOTS; i++)