Re: [PATCH] usb: chipidea: udc: reject non-control requests while controller is suspended

From: Andreea.Popescu@xxxxxxxxxxx

Date: Wed Apr 01 2026 - 02:45:36 EST

On Tue, Mar 31, 2026 at 12:21:45PM +0000, Andreea.Popescu@xxxxxxxxxxx wrote:
>> When Linux runtime PM autosuspends a ChipIdea UDC that is still
>> enumerated by the host, the driver gates the PHY clocks and marks
>> the controller as suspended (ci->in_lpm = 1) but deliberately leaves
>> gadget.speed unchanged so upper-layer gadget drivers do not see a
>> spurious disconnect.
>>
>> The problem is that those same drivers may continue to call
>> usb_ep_queue() during the autosuspend window. _hardware_enqueue()
>> silently adds the request to the endpoint queue and returns 0, but
>> hw_ep_prime() cannot succeed with gated clocks, so the completion
>> interrupt never fires. The request — and its backing buffer — is
>> permanently lost. The caller sees a successful return and never
>> frees the buffer.
>Won't the request complete normally after the gadget is resumed, or
>abnormally after a reset, disconnect, or shutdown? Either way, it
>wouldn't be lost permanently.
>
>Alan Stern
Thank you very much for the review!
On "complete normally after resume":
This would be true only if the runtime-resume path reprimed the pending endpoints. It does not. ci_controller_resume() clears PORTSC_PHCD and ci->in_lpm, restoring the PHY, but it performs no endpoint repriming. The TD that was enqueued during the suspended window has its DMA node linked in hwep->qh.queue and the QH's td.next is written, but the OP_ENDPTPRIME write inside hw_ep_prime() was a no-op against gated clocks. After resume the controller has no knowledge of that TD — the ENDPTPRIME/ENDPTSTAT bits are clean — so it never processes it. The request is not picked up automatically.
A subsequent request on the same endpoint would be appended to the existing TD chain via the "non-empty queue" branch of _hardware_enqueue(), which does not issue a fresh prime either; it relies on the hardware already being active on that endpoint. Since the first prime was lost, that chain never becomes active.

On "abnormally after reset/disconnect/shutdown":
Correct: _gadget_stop_activity() → _ep_nuke() would drain the queues. However the target deployment is a fixed embedded topology: the USB cable is permanently wired between the device and the host, no physical disconnect occurs. The gadget function carries a continuous data stream (a transport-layer protocol that generates background traffic — keepalives, ACKs — even during otherwise-idle periods). In this environment runtime PM autosuspend fires on the 2-second idle timeout, the host continues to trigger upper-layer submissions, and the connection runs uninterrupted for hours. Without an explicit disconnect or reset there is no natural cleanup path, so the stranded requests accumulate for the lifetime of the session.
The fix returns -ESHUTDOWN immediately, which is the same status _ep_nuke() sets. The caller treats it as a completed-with-error request, frees the buffer, and does not re-queue — matching the behaviour that would occur if the device had actually disconnected.

Concrete example — usb_diag_write() transmit path during autosuspend:
The device exposes a USB diagnostic interface (f_diag). A diagnostic session is running between the device's diagnostic daemon and the host-side tool. The controller autosuspends after 2 seconds of idle activity. Shortly after, the diagnostic core generates a log packet — for example a response to a pending request — and calls usb_diag_write().
The call sequence inside usb_diag_write() in f_diag.c without the patch:
ci_runtime_suspend() has already fired: PORTSC_PHCD set, ci->in_lpm = 1, gadget.speed still USB_SPEED_HIGH.
usb_diag_write() checks !ctxt->configured || !ctxt->in — both are valid since the gadget did not disconnect. Proceeds.
Dequeues a usb_request from ctxt->write_pool, sets req->buf = d_req->buf and calls kref_get(&ctxt->kref).
ctxt->dpkts_tolaptop_pending++ — pending counter incremented.
usb_ep_queue(in, req, GFP_ATOMIC) → ep_queue() → _ep_queue() → _hardware_enqueue(). hw_ep_prime() writes ENDPTPRIME against gated clocks — no-op. Returns 0.
usb_ep_queue returns 0. The error-recovery branch (list_add_tail back to write_pool, dpkts_tolaptop_pending--, kref_put) is not taken. usb_diag_write() returns 0.
The complete callback (diag_write_complete) never fires because the hardware never processed the TD. Consequences:
req is permanently removed from write_pool. After a handful of autosuspend cycles, write_pool is exhausted and every subsequent usb_diag_write() returns -EAGAIN.
kref_get called but kref_put in the complete callback never called — diag_context reference count is unbalanced.
dpkts_tolaptop_pending is permanently incremented — counter drifts.
d_req->buf (the diagnostic buffer allocated by the diag core) is never returned via the completion callback — the diag core's buffer pool drains.
The diagnostic session stalls silently: the host tool stops receiving responses, with no error indication on either side.

With the patch, step 5 returns -ESHUTDOWN instead of 0. The error path in usb_diag_write() executes: req is returned to write_pool, dpkts_tolaptop_pending is decremented, kref_put is called, and -EIO is returned to the diag core. The diag core can retry the write after runtime resume, write_pool remains intact, and the diagnostic session survives the autosuspend cycle.

________________________________________
De la: Alan Stern <stern@xxxxxxxxxxxxxxxxxxx>
Trimis: marți, 31 martie 2026 18:39
Către: Popescu, Andreea
Cc: Peter Chen; Greg Kroah-Hartman; linux-usb@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
Subiect: Re: [PATCH] usb: chipidea: udc: reject non-control requests while controller is suspended

CAUTION: This is an external email. Do not click or open any attachments unless you recognize the sender and know that the content is safe. (http://links.aumovio-internal.com/mailcheck)

On Tue, Mar 31, 2026 at 12:21:45PM +0000, Andreea.Popescu@xxxxxxxxxxx wrote:
> When Linux runtime PM autosuspends a ChipIdea UDC that is still
> enumerated by the host, the driver gates the PHY clocks and marks
> the controller as suspended (ci->in_lpm = 1) but deliberately leaves
> gadget.speed unchanged so upper-layer gadget drivers do not see a
> spurious disconnect.
>
> The problem is that those same drivers may continue to call
> usb_ep_queue() during the autosuspend window. _hardware_enqueue()
> silently adds the request to the endpoint queue and returns 0, but
> hw_ep_prime() cannot succeed with gated clocks, so the completion
> interrupt never fires. The request — and its backing buffer — is
> permanently lost. The caller sees a successful return and never
> frees the buffer.

Won't the request complete normally after the gadget is resumed, or
abnormally after a reset, disconnect, or shutdown? Either way, it
wouldn't be lost permanently.

Alan Stern