Re: [PATCH v3] usb: gadget: u_ether: Use __netif_rx() in rx_callback()

From: Hubert Wiśniewski
Date: Tue Oct 01 2024 - 10:07:12 EST

Next message: Lucas Stach: "Re: [PATCH v15 09/19] drm/etnaviv: Add constructor and destructor for the etnaviv_drm_private structure"
Previous message: Petr Pavlu: "Re: [PATCH v3 04/20] gendwarfksyms: Add address matching"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, 2024-09-27 at 16:12 +0200, Sebastian Andrzej Siewior wrote:
> On 2024-09-27 15:33:35 [+0200], Hubert Wiśniewski wrote:
> > On Thu, 2024-09-26 at 21:39 +0200, Hubert Wiśniewski wrote:
> > > I'm a bit at loss here. The deadlock seems to be unrelated to netif_rx()
> > > (which is not being called in the interrupt context after all), yet
> > > replacing it with __netif_rx() fixes the lockup (though a warning is still
> > > generated, which suggests that the patch does not completely fix the
> > > issue).
> >
> > Well, never mind. After some investigation, I think the problem is as
> > follows:
> >
> > 1. musb_g_giveback() releases the musb lock using spin_unlock(). The lock
> > is now released, but hardirqs are still disabled.
> >
> > 2. Then, usb_gadget_giveback_request() is called, which in turn calls
> > rx_complete(). This does not happen in the interrupt context, so netif_rx()
> > disables bottom havles, then enables them using local_bh_enable().
> >
> > 3. This leads to calling __local_bh_enable_ip(), which gives off a warning
> > (the first backtrace) that hardirqs are disabled. Then, hardirqs are
> > disabled (again?), and then enabled (as they should have been in the first
> > place).
> >
> > 4. After usb_gadget_giveback_request() returns, musb_g_giveback() acquires
> > the musb lock using spin_lock(). This does not disable hardirqs, so they
> > are still enabled.
> >
> > 5. While the musb lock is acquired, an interrupt occurs. It is handled by
> > dsps_interrupt(), which acquires the musb lock. A deadlock occurs.
>
> This all makes sense so far.

I have done more testing on this. It seems that this deadlock possibility
reported by lockdep is not the cause, but just a symptom.

For now, my conclusion is that the problem lies in the MUSB gadget driver
itself. Interrupts (in peripheral mode) on Rx endpoints are handled by
musb_g_rx(), which pulls requests from EP request queue. If there is no
request queued, it just returns without clearing the RXPKTRDY flag in the
RXCSR register (but the interrupt flag in the glue layer register has been
already cleared by the glue layer IRQ handler). This makes the received
packet wait for the next interrupt. If the Rx FIFO is full, no more packets
are received and no more interrupts are generated. The EP stays locked up
forever (or until the RXPKTRDY flag is cleared manually :)).

>From what I have learned, the request queue being empty just happens
sometimes and it is not en error.

This bug became exposed by the new behaviour of netif_rx(). When BHs are
enabled, hardirqs are enabled too (for a moment) which causes the Rx
interrupt to be handled before a request is enqueued. If there are enough
such unhandled packets, the EP gets locked up.

> > Replacing netif_rx() with __netif_rx() apparently fixes this part, as it
> > does not lead to any change of hardirq state. There is still one problem
> > though: rx_complete() is usually called from the interrupt context, except
> > when the network interface is brought up.
>
> __netif_rx() has an assert which should complain if you use
> __netif_rx(). Further in this case you pass the skb to backlog but never
> kick it for processing. Which means it is delayed until a random
> interrupt notices and processes it.

Now I see that it was a bad idea. I just found this using git bisect.

> > I think one solution would be to make musb_g_giveback() use
> > spin_unlock_irqrestore() and spin_lock_irqsave(), but I would need to pass
> > the flags to it somehow. Also, I am not sure how that would influence other
> > drivers using musb.
>
> I would also suggest to do this since the other solution is not safe/
> correct. There is the ->busy assignment which should cover for the most
> cases. If you drop the lock without enabling interrupts then the
> interrupt can't do anything to the EP and other enqueue/ dequeue
> invocation is not possible if run on UP. On the other hand am335x was
> used on PREEMPT_RT and it runs a UP machine into SMP so that should be
> covered :)
>
> While looking at it, dequeue/ enqueue during complete callback looks
> safe due to the busy flag.

I think it is not needed now. After I have modified the interrupt handling
code to clear the RXPKTRDY flag if there is no request queued and the FIFO
is full, neither __local_bh_enable_ip() nor lockdep complain (tested on SMP
and UP, with and without PREEMPT, on AM3358 and A64).

It would probably by nicer to ensure that no MUSB interrupts are handled
when a MUSB request callback is invoked from musb_g_giveback() (e.g. by
disabling MUSB interrupts before releasing the lock and enabling them after
acquiring it), but that could cause some side effects if the callback
relied on MUSB interrupts being enabled. And since there are no warnings
and everything works... I guess it is time to submit another patch then and
to forget about this one.

Thank you for your time!
--
Hubert Wiśniewski <hubert.wisniewski.25632@xxxxxxxxx>

Next message: Lucas Stach: "Re: [PATCH v15 09/19] drm/etnaviv: Add constructor and destructor for the etnaviv_drm_private structure"
Previous message: Petr Pavlu: "Re: [PATCH v3 04/20] gendwarfksyms: Add address matching"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]