Re: [PATCH] usb: dwc3: Potential fix of possible dwc3 interrupt storm
From: Selvarasu Ganesan
Date: Fri Sep 13 2024 - 08:42:39 EST
On 9/11/2024 5:54 AM, Thinh Nguyen wrote:
> On Tue, Sep 10, 2024, Selvarasu Ganesan wrote:
>> On 9/7/2024 6:09 AM, Thinh Nguyen wrote:
>>> On Sat, Sep 07, 2024, Selvarasu Ganesan wrote:
>>>> Hi Thinh,
>>>>
>>>> I ran the code you recommended on our testing environment and was able
>>>> to reproduce the issue one time.
>>>>
>>>> When evt->flags contains DWC3_EVENT_PENDING, I've included the following
>>>> debugging information: I added this debug message at the start of
>>>> dwc3_event_buffers_cleanup and dwc3_event_buffers_setup functions in
>>>> during suspend and resume.
>>>>
>>>> The results were quite interesting . I'm curious to understand how
>>>> evt->flags is set to DWC3_EVENT_PENDING, and along with DWC3_GEVNTSIZ is
>>>> equal to 0x1000 during the suspend.
>>> That is indeed strange.
>>>
>>>> Its means that the previous bottom-half handler prior to suspend might
>>>> still be executing in the middle of the process.
>>>>
>>>> Could you please give your suggestions here? And let me know if anything
>>>> want to test or additional details are required.
>>>>
>>>>
>>>> ##DBG: dwc3_event_buffers_cleanup:
>>>> evt->length :0x1000
>>>> evt->lpos :0x20c
>>>> evt->count :0x0
>>>> evt->flags :0x1 // This is Unexpected if DWC3_GEVNTSIZ(0)(0xc408):
>>>> 0x00001000. Its means that previous bottom-half handler may be still
>>>> running in middle
>>> Perhaps.
>>>
>>> But I doubt that's the case since it shouldn't take that long for the
>>> bottom-half to be completed before the next resume yet the flag is still
>>> set.
>>>
>>>> DWC3_GEVNTSIZ(0)(0xc408) : 0x00001000
>>>> DWC3_GEVNTCOUNT(0)(0xc40c) : 0x00000000
>>>> DWC3_DCFG(0xc700) : 0x00e008a8
>>>> DWC3_DCTL(0xc704) : 0x0cf00a00
>>>> DWC3_DEVTEN(0xc708) : 0x00000000
>>>> DWC3_DSTS(0xc70c) : 0x00d20cd1
>>>>
>>> The controller status is halted. So there's no problem with
>>> soft-disconnect. For the interrupt mask in GEVNTSIZ to be cleared,
>>> that likely means that the bottom-half had probably completed.
>> Agree, But I am worrying on If the bottom-half is completed, then
>> DWC3_EVENT_PENDING must be cleared in evt->flags.
>> Is there any possibility of a CPU reordering issue when updating
>> evt->flags in the bottom-half handler?.
>> Should I try with wmb() when writing to evt->flags?
> Assuming that the problem occurs after the bottom-half completed, there
> should be implicit memory barrier. The memory operation should complete
> before the release from spin_unlock complete. I don't think wmb() will
> help.
Agree.
>>>> ##DBG: dwc3_event_buffers_setup:
>>>> evt->length :0x1000
>>>> evt->lpos :0x20c
>>> They fact that evt->lpos did not get updated tells me that there's
>>> something wrong with memory access to your platform during suspend and
>>> resume.
>> Are you expecting the evt->lpos value to be zero here? If so, this is
>> expected in our test setup because we avoid writing zero to evt->lpos as
>> part of dwc3_event_buffers_cleanup if evt->flags have a value of 1. This
> Oh ok. I did not know you made this modification.
>
>> is simply to track the status of evt->lpos during suspend to resume when
>> evt->flags have a value of DWC3_EVENT_PENDING. The following test codes
>> for the reference.
>>
>> --- a/drivers/usb/dwc3/core.c
>> +++ b/drivers/usb/dwc3/core.c
>> @@ -505,8 +505,20 @@ static int dwc3_alloc_event_buffers(struct dwc3
>> *dwc, unsigned int length)
>> int dwc3_event_buffers_setup(struct dwc3 *dwc)
>> {
>> struct dwc3_event_buffer *evt;
>> + u32 reg;
>>
>> evt = dwc->ev_buf;
>> +
>> + if (evt->flags & DWC3_EVENT_PENDING) {
>> + pr_info("evt->length :%x\n", evt->length);
>> + pr_info("evt->lpos :%x\n", evt->lpos);
>> + pr_info("evt->count :%x\n", evt->count);
>> + pr_info("evt->flags :%x\n", evt->flags);
>> +
>> + dwc3_exynos_reg_dump(dwc);
>> +
>> + }
>> +
>> evt->lpos = 0;
>> dwc3_writel(dwc->regs, DWC3_GEVNTADRLO(0),
>> lower_32_bits(evt->dma));
>> @@ -514,8 +526,10 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)
>> upper_32_bits(evt->dma));
>> dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>> DWC3_GEVNTSIZ_SIZE(evt->length));
>> - dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0);
>>
>> + /* Clear any stale event */
>> + reg = dwc3_readl(dwc->regs, DWC3_GEVNTCOUNT(0));
>> + dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), reg);
>> return 0;
>> }
>>
>> @@ -525,7 +539,16 @@ void dwc3_event_buffers_cleanup(struct dwc3 *dwc)
>>
>> evt = dwc->ev_buf;
>>
>> - evt->lpos = 0;
>> + if (evt->flags & DWC3_EVENT_PENDING) {
>> + pr_info("evt->length :%x\n", evt->length);
>> + pr_info("evt->lpos :%x\n", evt->lpos);
>> + pr_info("evt->count :%x\n", evt->count);
>> + pr_info("evt->flags :%x\n", evt->flags);
>> +
>> + dwc3_exynos_reg_dump(dwc);
>> + } else {
>> + evt->lpos = 0;
> I wasn't aware of this change.
>
>> + }
>>
>>>> evt->count :0x0
>>>> evt->flags :0x1 // Still It's not clearing in during resume.
>>>>
>>>> DWC3_GEVNTSIZ(0)(0xc408) : 0x00000000
>>>> DWC3_GEVNTCOUNT(0)(0xc40c) : 0x00000000
>>>> DWC3_DCFG(0xc700) : 0x00080800
>>>> DWC3_DCTL(0xc704) : 0x00f00000
>>>> DWC3_DEVTEN(0xc708) : 0x00000000
>>>> DWC3_DSTS(0xc70c) : 0x00d20001
>>>>
>>> Please help look into your platform to see what condition triggers this
>>> memory access issue. If this is a hardware quirk, we can properly update
>>> the change and note it to be so.
>> Sure I will try to figure it out. However, we are facing challenges in
>> reproducing the issue. There could be a delay in understanding the
>> conditions that trigger the memory issue if it is related to a memory issue.
>>
>>> Thanks,
>>> Thinh
>>>
>>> (If possible, for future tests, please dump the dwc3 tracepoints. Many
>>> thanks for the tests.)
>> I tried to get dwc3 traces in the failure case, but so far no instances
>> have been reported. Our testing is still in progress with enable dwc3
>> traces.
>>
>> I will keep posting once I get the dwc3 traces in the failure scenario.
>>
> Thanks for the update. I hope enabling of the driver tracepoints will
> not impact the reproduction of the issue. With the driver log, we'll get
> more clues to what was going on.
>
> Thanks,
> Thinh
Hi Thinh,
So far, there have been no reported error instances. But, we suspecting
that the issue may be related to our glue driver. In our glue driver, we
access the reference of evt->flags when starting or stopping the gadget
based on a VBUS notification. We apologize for sharing this information
so late, as we only became aware of it recently.
The following sequence outlines the possible scenarios of race conditions:
Thread#1 (Our glue Driver Sequence)
===================================
->USB VBUS notification
->Start/Stop gadget
->dwc->ev_buf->flags |= BIT(20); (It's for our reference)
->Call dwc3 exynos runtime suspend/resume
->dwc->ev_buf->flags &= ~BIT(20);
->Call dwc3 core runtime suspend/resume
Thread#2
========
->dwc3_interrupt()
->evt->flags |= DWC3_EVENT_PENDING;
->dwc3_thread_interrupt()
->evt->flags &= ~DWC3_EVENT_PENDING;
After our internal discussions, we have decided to remove the
unnecessary access to evt->flag in our glue driver. We have made these
changes and initiated testing.
Thank you for your help so far to understand more into our glue driver code.
And We are thinking that it would be fine to reset evt->flag when the
USB controller is started, along with the changes you suggested earlier.
This additional measure will help prevent similar issues from occurring
in the future.
Please let us know your thoughts on this proposal. If it is not
necessary, we understand and will proceed accordingly.
Thanks,
Selva