Re: [PATCH 5/6] drm/msm/a6xx: Fix IRQ storm during msm_recovery test
From: Rob Clark
Date: Tue Jun 09 2026 - 10:26:17 EST
On Tue, Jun 9, 2026 at 6:09 AM Akhil P Oommen <akhilpo@xxxxxxxxxxxxxxxx> wrote:
>
> On 6/9/2026 5:50 AM, Rob Clark wrote:
> > On Mon, Jun 8, 2026 at 2:55 PM Akhil P Oommen <akhilpo@xxxxxxxxxxxxxxxx> wrote:
> >>
> >> On 6/5/2026 12:20 PM, Rob Clark wrote:
> >>> On Thu, Jun 4, 2026 at 1:10 PM Akhil P Oommen <akhilpo@xxxxxxxxxxxxxxxx> wrote:
> >>>>
> >>>> From: Jie Zhang <jie.zhang@xxxxxxxxxxxxxxxx>
> >>>>
> >>>> Once a hang is triggered by the msm_recovery test, the gpu error irq
> >>>> remains asserted and triggers an interrupt storm. In the worst case,
> >>>> this IRQ storm lands on the CPU core where the hangcheck timer is
> >>>> scheduled, blocking it from running. This eventually leads to CPU
> >>>> watchdog timeouts.
> >>>>
> >>>> To fix this, mask the gpu error irqs during msm_recovery test and
> >>>> enable them back during the recovery.
> >>>>
> >>>> Fixes: 5edf2750d998 ("drm/msm: Add debugfs to disable hw err handling")
> >>>> Signed-off-by: Jie Zhang <jie.zhang@xxxxxxxxxxxxxxxx>
> >>>> Signed-off-by: Akhil P Oommen <akhilpo@xxxxxxxxxxxxxxxx>
> >>>> ---
> >>>> drivers/gpu/drm/msm/adreno/a5xx_gpu.c | 5 +++++
> >>>> drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 5 ++++-
> >>>> drivers/gpu/drm/msm/adreno/a8xx_gpu.c | 5 ++++-
> >>>> drivers/gpu/drm/msm/msm_gpu.c | 2 ++
> >>>> 4 files changed, 15 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> >>>> index 2c0bbac43c52..f1df2514c613 100644
> >>>> --- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> >>>> +++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> >>>> @@ -1275,6 +1275,11 @@ static irqreturn_t a5xx_irq(struct msm_gpu *gpu)
> >>>> status & ~A5XX_RBBM_INT_0_MASK_RBBM_AHB_ERROR);
> >>>>
> >>>> if (priv->disable_err_irq) {
> >>>> + /* Turn off interrupts to avoid interrupt storm */
> >>>> + gpu_write(gpu, REG_A5XX_RBBM_INT_0_MASK,
> >>>> + A5XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS |
> >>>> + A5XX_RBBM_INT_0_MASK_CP_SW);
> >>>> +
> >>>> status &= A5XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS |
> >>>> A5XX_RBBM_INT_0_MASK_CP_SW;
> >>>> }
> >>>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> >>>> index 8b3bb2fd433b..9a4f9d0e1780 100644
> >>>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> >>>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> >>>> @@ -1911,8 +1911,11 @@ static irqreturn_t a6xx_irq(struct msm_gpu *gpu)
> >>>>
> >>>> gpu_write(gpu, REG_A6XX_RBBM_INT_CLEAR_CMD, status);
> >>>>
> >>>> - if (priv->disable_err_irq)
> >>>> + if (priv->disable_err_irq) {
> >>>> + /* Turn off interrupts to avoid interrupt storm */
> >>>> + gpu_write(gpu, REG_A6XX_RBBM_INT_0_MASK, A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS);
> >>>> status &= A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS;
> >>>> + }
> >>>>
> >>>> if (status & A6XX_RBBM_INT_0_MASK_RBBM_HANG_DETECT)
> >>>> a6xx_fault_detect_irq(gpu);
> >>>> diff --git a/drivers/gpu/drm/msm/adreno/a8xx_gpu.c b/drivers/gpu/drm/msm/adreno/a8xx_gpu.c
> >>>> index 9e44fd1ae634..0f6fd35bd587 100644
> >>>> --- a/drivers/gpu/drm/msm/adreno/a8xx_gpu.c
> >>>> +++ b/drivers/gpu/drm/msm/adreno/a8xx_gpu.c
> >>>> @@ -1211,8 +1211,11 @@ irqreturn_t a8xx_irq(struct msm_gpu *gpu)
> >>>>
> >>>> gpu_write(gpu, REG_A8XX_RBBM_INT_CLEAR_CMD, status);
> >>>>
> >>>> - if (priv->disable_err_irq)
> >>>> + if (priv->disable_err_irq) {
> >>>> + /* Turn off interrupts to avoid interrupt storm */
> >>>> + gpu_write(gpu, REG_A8XX_RBBM_INT_0_MASK, A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS);
> >>>> status &= A6XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS;
> >>>> + }
> >>>>
> >>>> if (status & A6XX_RBBM_INT_0_MASK_RBBM_HANG_DETECT)
> >>>> a8xx_fault_detect_irq(gpu);
> >>>> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> >>>> index 9ac7740a87f0..48ac51f4119b 100644
> >>>> --- a/drivers/gpu/drm/msm/msm_gpu.c
> >>>> +++ b/drivers/gpu/drm/msm/msm_gpu.c
> >>>> @@ -552,6 +552,8 @@ static void recover_worker(struct kthread_work *work)
> >>>> msm_update_fence(ring->fctx, fence);
> >>>> }
> >>>>
> >>>> + priv->disable_err_irq = false;
> >>>
> >>> Ok, so we rely on recovery to re-enable the error irqs.. that is
> >>> probably ok, given the intended purpose of the debugfs file. And,
> >>> well, it is debugfs. But why do we clear disable_err_irq here?
> >>
> >> Now that we are updating the IRQ mask register which won't reset until
> >> there is a gpu suspend, its side effect will be felt even after
> >> userspace deasserts the debugfs knob, potentially into the next
> >> testcase. This is different from the older behavior. So, I felt it would
> >> be better to reset this flag during the recovery, considering
> >> msm_recovery is the only user of this knob, afaiu.
> >
> > Hmm... maybe debugfs writes should just immediately update the irq
> > mask (if the gpu is powered)?
>
> That needs some plumbing in adreno func table to program the register.
> We can do that if you prefer that, but is it an overkill for this usecase?
Yeah, that is why I didn't suggest it earlier. But stealth re-enable
of err irqs makes the behavior harder to reason about, which I
dislike. I guess it works now because msm_recovery only does a single
hang-test w/ hw error irq's disabled. But from userspace PoV it seems
natural to expect error irq's to remain disabled until writing debugfs
again.
Since it is about debugfs and a single igt test, maybe it is just best
to document that this is how it works.
Or, would it be reasonable just to update the irq mask in gpu->submit()?
BR,
-R
> -Akhil
>
> >
> > BR,
> > -R
> >
> >> I should have explicitly called out this new behavior of disable_err_irq
> >> in the commit text, but I forgot.
> >>
> >> -Akhil.
> >>
> >>>
> >>> BR,
> >>> -R
> >>>
> >>>> +
> >>>> gpu->funcs->recover(gpu);
> >>>>
> >>>> /* retire completed submits, plus the one that hung: */
> >>>>
> >>>> --
> >>>> 2.51.0
> >>>>
> >>
>