Re: [PATCH] usb: dwc3: Potential fix of possible dwc3 interrupt storm

From: Selvarasu Ganesan
Date: Mon Sep 02 2024 - 07:27:55 EST



On 8/31/2024 6:20 AM, Thinh Nguyen wrote:
> Hi Selvarasu,
>
> On Fri, Aug 30, 2024, Selvarasu Ganesan wrote:
>> On 8/10/2024 5:12 AM, Thinh Nguyen wrote:
>>> On Thu, Aug 08, 2024, Selvarasu Ganesan wrote:
>>>> On 8/8/2024 6:45 AM, Thinh Nguyen wrote:
>>>>> On Wed, Aug 07, 2024, Selvarasu Ganesan wrote:
>>>>>> On 8/7/2024 6:08 AM, Thinh Nguyen wrote:
>>>>>>> On Fri, Jul 19, 2024, Selvarasu Ganesan wrote:
>>>>>>>> In certain scenarios, there is a chance that the CPU may not be
>>>>>>>> scheduled the bottom half of dwc3 interrupt. This is because the CPU
>>>>>>>> may hang up where any work queue lockup has happened for the same CPU
>>>>>>>> that is trying to schedule the dwc3 thread interrupt. In this scenario,
>>>>>>>> the USB can enter runtime suspend as the bus may idle for a longer time
>>>>>>>> , or user can reconnect the USB cable. Then, the dwc3 event interrupt
>>>>>>>> can be enabled when runtime resume is happening with regardless of the
>>>>>>>> previous event status. This can lead to a dwc3 IRQ storm due to the
>>>>>>>> return from the interrupt handler by checking only the evt->flags as
>>>>>>>> DWC3_EVENT_PENDING, where the same flag was set as DWC3_EVENT_PENDING
>>>>>>>> in previous work queue lockup.
>>>>>>>> Let's consider the following sequences in this scenario,
>>>>>>>>
>>>>>>>> Call trace of dwc3 IRQ after workqueue lockup scenario
>>>>>>>> ======================================================
>>>>>>>> IRQ #1:
>>>>>>>> ->dwc3_interrupt()
>>>>>>>> ->dwc3_check_event_buf()
>>>>>>>> ->if (evt->flags & DWC3_EVENT_PENDING)
>>>>>>>> return IRQ_HANDLED;
>>>>>>>> ->evt->flags |= DWC3_EVENT_PENDING;
>>>>>>>> ->/* Disable interrupt by setting DWC3_GEVNTSIZ_INTMASK in
>>>>>>>> DWC3_GEVNTSIZ
>>>>>>>> ->return IRQ_WAKE_THREAD; // No workqueue scheduled for dwc3
>>>>>>>> thread_fu due to workqueue lockup
>>>>>>>> even after return IRQ_WAKE_THREAD
>>>>>>>> from top-half.
>>>>>>>>
>>>>>>>> Thread #2:
>>>>>>>> ->dwc3_runtime_resume()
>>>>>>>> ->dwc3_resume_common()
>>>>>>>> ->dwc3_gadget_resume()
>>>>>>>> ->dwc3_gadget_soft_connect()
>>>>>>>> ->dwc3_event_buffers_setup()
>>>>>>>> ->/*Enable interrupt by clearing DWC3_GEVNTSIZ_INTMASK in
>>>>>>>> DWC3_GEVNTSIZ*/
>>>>>>>>
>>>>>>>> Start IRQ Storming after enable dwc3 event in resume path
>>>>>>>> =========================================================
>>>>>>>> CPU0: IRQ
>>>>>>>> dwc3_interrupt()
>>>>>>>> dwc3_check_event_buf()
>>>>>>>> if (evt->flags & DWC3_EVENT_PENDING)
>>>>>>>> return IRQ_HANDLED;
>>>>>>>>
>>>>>>>> CPU0: IRQ
>>>>>>>> dwc3_interrupt()
>>>>>>>> dwc3_check_event_buf()
>>>>>>>> if (evt->flags & DWC3_EVENT_PENDING)
>>>>>>>> return IRQ_HANDLED;
>>>>>>>> ..
>>>>>>>> ..
>>>>>>>>
>>>>>>>> To fix this issue by avoiding enabling of the dwc3 event interrupt in
>>>>>>>> the runtime resume path if dwc3 event processing is in progress.
>>>>>>>>
>>>>>>>> Signed-off-by: Selvarasu Ganesan <selvarasu.g@xxxxxxxxxxx>
>>>>>>>> ---
>>>>>>>> drivers/usb/dwc3/core.c | 8 ++++++--
>>>>>>>> 1 file changed, 6 insertions(+), 2 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
>>>>>>>> index cb82557678dd..610792a70805 100644
>>>>>>>> --- a/drivers/usb/dwc3/core.c
>>>>>>>> +++ b/drivers/usb/dwc3/core.c
>>>>>>>> @@ -549,8 +549,12 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)
>>>>>>>> lower_32_bits(evt->dma));
>>>>>>>> dwc3_writel(dwc->regs, DWC3_GEVNTADRHI(0),
>>>>>>>> upper_32_bits(evt->dma));
>>>>>>>> - dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>>>>>>>> - DWC3_GEVNTSIZ_SIZE(evt->length));
>>>>>>>> +
>>>>>>>> + /* Skip enable dwc3 event interrupt if event is processing in middle */
>>>>>>>> + if (!(evt->flags & DWC3_EVENT_PENDING))
>>>>>>>> + dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>>>>>>>> + DWC3_GEVNTSIZ_SIZE(evt->length));
>>>>>>>> +
>>>>>>>> dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0);
>>>>>>>>
>>>>>>>> return 0;
>>>>>>>> --
>>>>>>>> 2.17.1
>>>>>>>>
>>>>>>> We're not waking up from a hibernation. So after a soft-reset and
>>>>>>> resume, the events that weren't processed are stale. They should be
>>>>>>> processed prior to entering suspend or be discarded before resume.
>>>>>>>
>>>>>>> The synchronize_irq() during suspend() was not sufficient to prevent
>>>>>>> this? What are we missing here.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Thinh
>>>>>> I don’t think the triggering of interrupt would not be stopped even if
>>>>>> do soft reset. It's because of event count is may be valid .
>>>>> Ok. I think I see what you're referring to when you say "event is
>>>>> processing in the middle" now.
>>>>>
>>>>> What you want to check is probably this in dwc3_event_buffers_setup().
>>>>> Please confirm:
>>>>>
>>>>> if (dwc->pending_events)
>>>>> dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>>>>> DWC3_GEVNTSIZ_INTMASK | DWC3_GEVNTSIZ_SIZE(evt->length));
>>>>> else
>>>>> dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0), DWC3_GEVNTSIZ_SIZE(evt->length));
>>>> Yes, we are expecting the same. But, we must verify the status of
>>>> evt->flags, which will indicate whether the event is currently
>>>> processing in middle or not. The below code is for the reference.
>>>>
>>>> if (!(evt->flags & DWC3_EVENT_PENDING))
>>>> dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>>>> DWC3_GEVNTSIZ_SIZE(evt->length));
>>>> else
>>>> dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>>>> DWC3_GEVNTSIZ_INTMASK | DWC3_GEVNTSIZ_SIZE(evt->length));
>>> So, this happens while pending_events is set right? I need to review
>>> this runtime suspend flow next week. Something doesn't look right. When
>>> there's a suspend/resume runtime or not, there's a soft disconnect. We
>>> shouldn't be processing any event prior to going into suspend. Also, we
>>> shouldn't be doing soft-disconnect while connected and in operation
>>> unless we specifically tell it to.
>> HI Thinh,
>>
>> Would you be able to review this runtime suspend flow?
>>
>> From our end, after conducting multiple regression tests, we have
>> determined that the resetting of "evt->flags" are sufficient when
>> attempting to enable event IRQ masks instead of enable event IRQ mask
>> based on pending event flags. There is a possibility that reconnecting
>> USB with the host PC may cause event interrupts to be missed by the CPU
>> if disable event IRQ mask.  So, The fix should be as follow. Could you
>> please review this once from your end?
>>
>> diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
>> index ccc3895dbd7f..3b2441608e9e 100644
>> --- a/drivers/usb/dwc3/core.c
>> +++ b/drivers/usb/dwc3/core.c
>> @@ -554,6 +554,15 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)
>>                         lower_32_bits(evt->dma));
>>         dwc3_writel(dwc->regs, DWC3_GEVNTADRHI(0),
>>                         upper_32_bits(evt->dma));
>> +
>> +       /*
>> +        * The DWC3_EVENT_PENDING flag is cleared if it has
>> +        * already been set when enabling the event IRQ mask
>> +        * to prevent possibly of an IRQ storm.
>> +        */
>> +       if (evt->flags & DWC3_EVENT_PENDING)
>> +               evt->flags &= ~DWC3_EVENT_PENDING;
>> +
>>         dwc3_writel(dwc->regs, DWC3_GEVNTSIZ(0),
>>                         DWC3_GEVNTSIZ_SIZE(evt->length));
>>         dwc3_writel(dwc->regs, DWC3_GEVNTCOUNT(0), 0);
>>
> Sorry for the delay response.
>
> In addition to that, please rework and rename
> dwc3_gadget_process_pending_event(). We're not supposed to handle any
> left-over event. So remove the manual calls to the interrupt handlers
> there.
Hi Thinh,

Thanks for your comments.

Regarding the handling of leftover events during dwc3 runtime resume, I
understand that we are not supposed to handle any leftover events. Would
you be interested in making changes to the code as suggested below? The
reason for removing interrupt handlers from
dwc3_gadget_process_pending_events() is to avoid handling any leftover
events from the last suspend right. If so, based on my understanding, we
can simply remove the use of dwc3_gadget_process_pending_events()
instead of rework on this function since it is not necessary if we
remove the call to interrupt handlers from there.

I would like to reconfirm from our end that in our failure scenario, we
observe that DWC3_EVENT_PENDING is set in evt->flags when the dwc3
resume sequence is executed, and the dwc->pending_events flag is not
being set.



diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
index ccc3895dbd7f..63e8dd24ad0e 100644
--- a/drivers/usb/dwc3/core.c
+++ b/drivers/usb/dwc3/core.c
@@ -550,6 +550,15 @@ int dwc3_event_buffers_setup(struct dwc3 *dwc)

        evt = dwc->ev_buf;
        evt->lpos = 0;
+
+       /*
+        * The DWC3_EVENT_PENDING flag is cleared if it has
+        * already been set when enabling the event IRQ mask
+        * to prevent possibly of an IRQ storm.
+        */
+       if (evt->flags & DWC3_EVENT_PENDING)
+               evt->flags &= ~DWC3_EVENT_PENDING;
+
        dwc3_writel(dwc->regs, DWC3_GEVNTADRLO(0),
lower_32_bits(evt->dma));
        dwc3_writel(dwc->regs, DWC3_GEVNTADRHI(0),
diff --git a/drivers/usb/dwc3/gadget.c b/drivers/usb/dwc3/gadget.c
index 89fc690fdf34..951c805337c2 100644
--- a/drivers/usb/dwc3/gadget.c
+++ b/drivers/usb/dwc3/gadget.c
@@ -4739,8 +4739,6 @@ int dwc3_gadget_resume(struct dwc3 *dwc)
 void dwc3_gadget_process_pending_events(struct dwc3 *dwc)
 {
        if (dwc->pending_events) {
- dwc3_interrupt(dwc->irq_gadget, dwc->ev_buf);
- dwc3_thread_interrupt(dwc->irq_gadget, dwc->ev_buf);
                pm_runtime_put(dwc->dev);
                dwc->pending_events = false;
enable_irq(dwc->irq_gadget);


Thanks,
Selva

>
> On runtime suspend, the device is soft disconnected. So any interrupt
> assertion to notify a new connection must be a custom configuration of
> your platform. No event should be generated while the run_stop bit is
> cleared.
>
> On runtime resume, we will initiate soft-reset and soft-connect to
> restore the run_stop bit. A new connection event will be generated then.

Agree.


>
> Thanks,
> Thinh