Re: [PATCH] usb: dwc3: Potential fix of possible dwc3 interrupt storm

From: Thinh Nguyen
Date: Tue Sep 10 2024 - 20:24:37 EST


On Tue, Sep 10, 2024, Selvarasu Ganesan wrote:
>
> On 9/7/2024 6:09 AM, Thinh Nguyen wrote:
> > On Sat, Sep 07, 2024, Selvarasu Ganesan wrote:
> >> Hi Thinh,
> >>
> >> I ran the code you recommended on our testing environment and was able
> >> to reproduce the issue one time.
> >>
> >> When evt->flags contains DWC3_EVENT_PENDING, I've included the following
> >> debugging information: I added this debug message at the start of
> >> dwc3_event_buffers_cleanup and dwc3_event_buffers_setup functions in
> >> during suspend and resume.
> >>
> >> The results were quite interesting . I'm curious to understand how
> >> evt->flags is set to DWC3_EVENT_PENDING, and along with DWC3_GEVNTSIZ is
> >> equal to 0x1000 during the suspend.
> > That is indeed strange.
> >
> >> Its means that the previous bottom-half handler prior to suspend might
> >> still be executing in the middle of the process.
> >>
> >> Could you please give your suggestions here? And let me know if anything
> >> want to test or additional details are required.
> >>
> >>
> >> ##DBG: dwc3_event_buffers_cleanup:
> >>  evt->length    :0x1000
> >>  evt->lpos      :0x20c
> >>  evt->count     :0x0
> >>  evt->flags     :0x1 // This is Unexpected if DWC3_GEVNTSIZ(0)(0xc408):
> >> 0x00001000. Its means that previous bottom-half handler may be still
> >> running in middle
> > Perhaps.
> >
> > But I doubt that's the case since it shouldn't take that long for the
> > bottom-half to be completed before the next resume yet the flag is still
> > set.
> >
> >>  DWC3_GEVNTSIZ(0)(0xc408)       : 0x00001000
> >>  DWC3_GEVNTCOUNT(0)(0xc40c)     : 0x00000000
> >>  DWC3_DCFG(0xc700)              : 0x00e008a8
> >>  DWC3_DCTL(0xc704)              : 0x0cf00a00
> >>  DWC3_DEVTEN(0xc708)            : 0x00000000
> >>  DWC3_DSTS(0xc70c)              : 0x00d20cd1
> >>
> > The controller status is halted. So there's no problem with
> > soft-disconnect. For the interrupt mask in GEVNTSIZ to be cleared,
> > that likely means that the bottom-half had probably completed.
>
> Agree, But I am worrying on If the bottom-half is completed, then
> DWC3_EVENT_PENDING must be cleared in evt->flags.
> Is there any possibility of a CPU reordering issue when updating
> evt->flags in the bottom-half handler?.
> Should I try with wmb() when writing to evt->flags?

Assuming that the problem occurs after the bottom-half completed, there
should be implicit memory barrier. The memory operation should complete
before the release from spin_unlock complete. I don't think wmb() will
help.

> >
> >> ##DBG: dwc3_event_buffers_setup:
> >>  evt->length    :0x1000
> >>  evt->lpos      :0x20c
> > They fact that evt->lpos did not get updated tells me that there's
> > something wrong with memory access to your platform during suspend and
> > resume.
>
> Are you expecting the evt->lpos value to be zero here? If so, this is
> expected in our test setup because we avoid writing zero to evt->lpos as
> part of dwc3_event_buffers_cleanup if evt->flags have a value of 1. This

Oh ok. I did not know you made this modification.

> is simply to track the status of evt->lpos during suspend to resume when
> evt->flags have a value of DWC3_EVENT_PENDING. The following test codes
> for the reference.
>
> --- a/drivers/usb/dwc3/core.c
> +++ b/drivers/usb/dwc3/core.c
> @@ -505,8 +505,20 @@ static int dwc3_alloc_event_buffers(struct dwc3
> *dwc, unsigned int length)
>  int dwc3_event_buffers_setup(struct dwc3 *dwc)
>  {
>         struct dwc3_event_buffer        *evt;
> +       u32                             reg;
>
>         evt = dwc->ev_buf;
> +
> +       if (evt->flags & DWC3_EVENT_PENDING) {
> +               pr_info("evt->length :%x\n", evt->length);
> +               pr_info("evt->lpos :%x\n", evt->lpos);
> +               pr_info("evt->count :%x\n", evt->count);
> +               pr_info("evt->flags :%x\n", evt->flags);
> +
> +               dwc3_exynos_reg_dump(dwc);
> +
> +       }
> +
>         evt->lpos = 0;
>         dwc3_writel(dwc->regs, DWC3_GEVNTADRLO(0),
>                         lower_32_bits(evt->dma));
> @@ -514,8 +526,10 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)
>                         upper_32_bits(evt->dma));
>         dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>                         DWC3_GEVNTSIZ_SIZE(evt->length));
> -       dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0);
>
> +       /* Clear any stale event */
> +       reg = dwc3_readl(dwc->regs, DWC3_GEVNTCOUNT(0));
> +       dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), reg);
>         return 0;
>  }
>
> @@ -525,7 +539,16 @@ void dwc3_event_buffers_cleanup(struct dwc3 *dwc)
>
>         evt = dwc->ev_buf;
>
> -       evt->lpos = 0;
> +       if (evt->flags & DWC3_EVENT_PENDING) {
> +               pr_info("evt->length :%x\n", evt->length);
> +               pr_info("evt->lpos :%x\n", evt->lpos);
> +               pr_info("evt->count :%x\n", evt->count);
> +               pr_info("evt->flags :%x\n", evt->flags);
> +
> +               dwc3_exynos_reg_dump(dwc);
> +       } else {
> +               evt->lpos = 0;

I wasn't aware of this change.

> +       }
>
> >
> >>  evt->count     :0x0
> >>  evt->flags     :0x1 // Still It's not clearing in during resume.
> >>
> >>  DWC3_GEVNTSIZ(0)(0xc408)       : 0x00000000
> >>  DWC3_GEVNTCOUNT(0)(0xc40c)     : 0x00000000
> >>  DWC3_DCFG(0xc700)              : 0x00080800
> >>  DWC3_DCTL(0xc704)              : 0x00f00000
> >>  DWC3_DEVTEN(0xc708)            : 0x00000000
> >>  DWC3_DSTS(0xc70c)              : 0x00d20001
> >>
> > Please help look into your platform to see what condition triggers this
> > memory access issue. If this is a hardware quirk, we can properly update
> > the change and note it to be so.
>
> Sure I will try to figure it out. However, we are facing challenges in
> reproducing the issue. There could be a delay in understanding the
> conditions that trigger the memory issue if it is related to a memory issue.
>
> >
> > Thanks,
> > Thinh
> >
> > (If possible, for future tests, please dump the dwc3 tracepoints. Many
> > thanks for the tests.)
>
> I tried to get dwc3 traces in the failure case, but so far no instances
> have been reported. Our testing is still in progress with enable dwc3
> traces.
>
> I will keep posting once I get the dwc3 traces in the failure scenario.
>

Thanks for the update. I hope enabling of the driver tracepoints will
not impact the reproduction of the issue. With the driver log, we'll get
more clues to what was going on.

Thanks,
Thinh