Re: [PATCH] usb: dwc3: Potential fix of possible dwc3 interrupt storm

From: Selvarasu Ganesan
Date: Fri Aug 30 2024 - 08:47:57 EST



On 8/10/2024 5:12 AM, Thinh Nguyen wrote:
> On Thu, Aug 08, 2024, Selvarasu Ganesan wrote:
>> On 8/8/2024 6:45 AM, Thinh Nguyen wrote:
>>> On Wed, Aug 07, 2024, Selvarasu Ganesan wrote:
>>>> On 8/7/2024 6:08 AM, Thinh Nguyen wrote:
>>>>> On Fri, Jul 19, 2024, Selvarasu Ganesan wrote:
>>>>>> In certain scenarios, there is a chance that the CPU may not be
>>>>>> scheduled the bottom half of dwc3 interrupt. This is because the CPU
>>>>>> may hang up where any work queue lockup has happened for the same CPU
>>>>>> that is trying to schedule the dwc3 thread interrupt. In this scenario,
>>>>>> the USB can enter runtime suspend as the bus may idle for a longer time
>>>>>> , or user can reconnect the USB cable. Then, the dwc3 event interrupt
>>>>>> can be enabled when runtime resume is happening with regardless of the
>>>>>> previous event status. This can lead to a dwc3 IRQ storm due to the
>>>>>> return from the interrupt handler by checking only the evt->flags as
>>>>>> DWC3_EVENT_PENDING, where the same flag was set as DWC3_EVENT_PENDING
>>>>>> in previous work queue lockup.
>>>>>> Let's consider the following sequences in this scenario,
>>>>>>
>>>>>> Call trace of dwc3 IRQ after workqueue lockup scenario
>>>>>> ======================================================
>>>>>> IRQ #1:
>>>>>> ->dwc3_interrupt()
>>>>>> ->dwc3_check_event_buf()
>>>>>> ->if (evt->flags & DWC3_EVENT_PENDING)
>>>>>> return IRQ_HANDLED;
>>>>>> ->evt->flags |= DWC3_EVENT_PENDING;
>>>>>> ->/* Disable interrupt by setting DWC3_GEVNTSIZ_INTMASK in
>>>>>> DWC3_GEVNTSIZ
>>>>>> ->return IRQ_WAKE_THREAD; // No workqueue scheduled for dwc3
>>>>>> thread_fu due to workqueue lockup
>>>>>> even after return IRQ_WAKE_THREAD
>>>>>> from top-half.
>>>>>>
>>>>>> Thread #2:
>>>>>> ->dwc3_runtime_resume()
>>>>>> ->dwc3_resume_common()
>>>>>> ->dwc3_gadget_resume()
>>>>>> ->dwc3_gadget_soft_connect()
>>>>>> ->dwc3_event_buffers_setup()
>>>>>> ->/*Enable interrupt by clearing DWC3_GEVNTSIZ_INTMASK in
>>>>>> DWC3_GEVNTSIZ*/
>>>>>>
>>>>>> Start IRQ Storming after enable dwc3 event in resume path
>>>>>> =========================================================
>>>>>> CPU0: IRQ
>>>>>> dwc3_interrupt()
>>>>>> dwc3_check_event_buf()
>>>>>> if (evt->flags & DWC3_EVENT_PENDING)
>>>>>> return IRQ_HANDLED;
>>>>>>
>>>>>> CPU0: IRQ
>>>>>> dwc3_interrupt()
>>>>>> dwc3_check_event_buf()
>>>>>> if (evt->flags & DWC3_EVENT_PENDING)
>>>>>> return IRQ_HANDLED;
>>>>>> ..
>>>>>> ..
>>>>>>
>>>>>> To fix this issue by avoiding enabling of the dwc3 event interrupt in
>>>>>> the runtime resume path if dwc3 event processing is in progress.
>>>>>>
>>>>>> Signed-off-by: Selvarasu Ganesan <selvarasu.g@xxxxxxxxxxx>
>>>>>> ---
>>>>>> drivers/usb/dwc3/core.c | 8 ++++++--
>>>>>> 1 file changed, 6 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
>>>>>> index cb82557678dd..610792a70805 100644
>>>>>> --- a/drivers/usb/dwc3/core.c
>>>>>> +++ b/drivers/usb/dwc3/core.c
>>>>>> @@ -549,8 +549,12 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)
>>>>>> lower_32_bits(evt->dma));
>>>>>> dwc3_writel(dwc->regs, DWC3_GEVNTADRHI(0),
>>>>>> upper_32_bits(evt->dma));
>>>>>> - dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>>>>>> - DWC3_GEVNTSIZ_SIZE(evt->length));
>>>>>> +
>>>>>> + /* Skip enable dwc3 event interrupt if event is processing in middle */
>>>>>> + if (!(evt->flags & DWC3_EVENT_PENDING))
>>>>>> + dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>>>>>> + DWC3_GEVNTSIZ_SIZE(evt->length));
>>>>>> +
>>>>>> dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0);
>>>>>>
>>>>>> return 0;
>>>>>> --
>>>>>> 2.17.1
>>>>>>
>>>>> We're not waking up from a hibernation. So after a soft-reset and
>>>>> resume, the events that weren't processed are stale. They should be
>>>>> processed prior to entering suspend or be discarded before resume.
>>>>>
>>>>> The synchronize_irq() during suspend() was not sufficient to prevent
>>>>> this? What are we missing here.
>>>>>
>>>>> Thanks,
>>>>> Thinh
>>>> I don’t think the triggering of interrupt would not be stopped even if
>>>> do soft reset. It's because of event count is may be valid .
>>> Ok. I think I see what you're referring to when you say "event is
>>> processing in the middle" now.
>>>
>>> What you want to check is probably this in dwc3_event_buffers_setup().
>>> Please confirm:
>>>
>>> if (dwc->pending_events)
>>> dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>>> DWC3_GEVNTSIZ_INTMASK | DWC3_GEVNTSIZ_SIZE(evt->length));
>>> else
>>> dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0), DWC3_GEVNTSIZ_SIZE(evt->length));
>> Yes, we are expecting the same. But, we must verify the status of
>> evt->flags, which will indicate whether the event is currently
>> processing in middle or not. The below code is for the reference.
>>
>> if (!(evt->flags & DWC3_EVENT_PENDING))
>> dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>> DWC3_GEVNTSIZ_SIZE(evt->length));
>> else
>> dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>> DWC3_GEVNTSIZ_INTMASK | DWC3_GEVNTSIZ_SIZE(evt->length));
> So, this happens while pending_events is set right? I need to review
> this runtime suspend flow next week. Something doesn't look right. When
> there's a suspend/resume runtime or not, there's a soft disconnect. We
> shouldn't be processing any event prior to going into suspend. Also, we
> shouldn't be doing soft-disconnect while connected and in operation
> unless we specifically tell it to.
HI Thinh,

Would you be able to review this runtime suspend flow?

From our end, after conducting multiple regression tests, we have
determined that the resetting of "evt->flags" are sufficient when
attempting to enable event IRQ masks instead of enable event IRQ mask
based on pending event flags. There is a possibility that reconnecting
USB with the host PC may cause event interrupts to be missed by the CPU
if disable event IRQ mask.  So, The fix should be as follow. Could you
please review this once from your end?

diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
index ccc3895dbd7f..3b2441608e9e 100644
--- a/drivers/usb/dwc3/core.c
+++ b/drivers/usb/dwc3/core.c
@@ -554,6 +554,15 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)
                        lower_32_bits(evt->dma));
        dwc3_writel(dwc->regs, DWC3_GEVNTADRHI(0),
                        upper_32_bits(evt->dma));
+
+       /*
+        * The DWC3_EVENT_PENDING flag is cleared if it has
+        * already been set when enabling the event IRQ mask
+        * to prevent possibly of an IRQ storm.
+        */
+       if (evt->flags & DWC3_EVENT_PENDING)
+               evt->flags &= ~DWC3_EVENT_PENDING;
+
        dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
                        DWC3_GEVNTSIZ_SIZE(evt->length));
        dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0);

Thanks,
Selva
>
>>>> Let consider the scenarios where SW is not acknowledge the event count
>>>> after getting a interrupt with disable GEVNTSIZ_INTMASK =0. It will
>>>> triggering the spurious interrupts until enable GEVNTSIZ_INTMASK=1 or
>>>> acknowledge the event count by SW. This is happening here because of We
>>>> just return from interrupt handler by checking if evt->flags as
>>>> DWC3_EVENT_PENDING. Clearing of DWC3_EVENT_PENDING flag is done in
>>>> dwc3_thread_interrupt. In some scenario it's not happening due to cpu
>>>> hangup or work queue lockup.
>>> This can be mitigated by adjusting the imod_interval (interrupt
>>> moderation). Have you tried that?
>>
>> Yes we tried to play around the changing of imod interval value but no
>> improvements.
> Ok.
>
> Thanks,
> Thinh