Re: [PATCH] usb: xhci: add xhci_halt() for HCE Handling

From: Thinh Nguyen

Date: Fri Feb 27 2026 - 19:19:48 EST

On Fri, Feb 27, 2026, Mathias Nyman wrote:
> On 2/26/26 20:17, Thinh Nguyen wrote:
> > On Thu, Feb 26, 2026, Mathias Nyman wrote:
> > > On 2/26/26 11:27, Dayu Jiang wrote:
> > > > Hi Greg,
> > > >
> > > > I have updated the changelog text as requested and resubmitted the patch.
> > > > https://urldefense.com/v3/__https://lore.kernel.org/linux-usb/20260128100746.561626-1-jiangdayu@xxxxxxxxxx/__;!!A4F2R9G_pg!ZSJNDKyOinm26qngopLW-axiQtwDAMely4bDqtqYDGv1ErWCtS6kZ6ZamdiKoZKuCyCk0IxMQK5g625GEIxYWFzKpAEiCUq7$
> > > > Please kindly review it and let me know if it is acceptable now.
> > >
> > > I'll send it forward, but changed the commit message.
> > > Does this modified version still describe the case accurately:
> > >
> > > usb: xhci: Prevent interrupt storm on host controller error (HCE)
> > >
> > > The xHCI controller reports a Host Controller Error (HCE) in UAS Storage
> > > Device plug/unplug scenarios on Android devices, which is checked in
> > > xhci_irq() function and causes an interrupt storm (since the interrupt
> > > isn’t cleared), leading to severe system-level faults.
> > >
> > > When the xHC controller reports HCE in the interrupt handler, the driver
> > > only logs a warning and assumes xHC activity will stop. The interrupt storm
> > > does however continue until driver manually disables xHC interrupt and
> > > stops the controller by calling xhci_halt().
> > >
> > > The change is made in xhci_irq() function where STS_HCE status is
> > > checked, mirroring the existing error handling pattern used for
> > > STS_FATAL errors.
> > >
> > > This only fixes the interrupt storm. Proper HCE recovery requires resetting
> > > and re-initializing the xHC.
> > >
> >
> > The controller is halted if there's an error like HCE. It's odd to try
> > to "halt" it again. Not sure how this will impact for other controllers.
>
> This is why I changed the commit message from:
>
> "When the xHCI controller reports HCE in the interrupt handler, the driver
> currently only logs a warning and continues execution. However, HCE
> indicates a critical hardware failure that requires the controller to be
> halted. This ensures the controller is in a consistent state and prevents
> further operations on failed hardware."
>
> to:
>
> "When the xHC controller reports HCE in the interrupt handler, the driver
> only logs a warning and assumes xHC activity will stop. The interrupt storm
> does however continue until driver manually disables xHC interrupt and
> stops the controller by calling xhci_halt()."
>
> I can clarify it further by stating that .."assumes xHC activity will stop
> as stated in xHCI spec. On some xHC controllers an interrupt storm continues after
> HCE error, and only ceases after manually"..
>
> The host is messed up at this point, and we are not recovering it. I don't think
> there is any harm in a manual halt at this stage.

We should update the xhci driver states when there's HCE and the
controller is halted but we don't need to manually clear the Run/Stop
bit again.

>
> > Even if we don't have the full HCE recovery implemented, did we try to
> > just do HCRST, which is the first step of the recovery?
>
> Specs state that HCRST might re-trigger the HCE if it's due to a "hard" fault,

That's only after we re-initialize the controller as noted in the spec,
not immidiately after HCRST.

> and driver needs to take action to prevent a HCE - HCRST recovery loop.
>
> HCRST will clear all registers, so we need to reinitialize everything here,
> write back addresses of event rings, command rings, DCBAA, scratchpads
> dequeue pointers etc.
>
> I support taking this fix to prevent the interrupt storm, an issue seen in real
> life. And then solve proper recovery later.

That's fair to me.

>
> Niklas is actually working on decoupling memory allocation and xHC register
> initialization which will help future HCE recovery work.
>

That's great! I'm looking forward to that.

Thanks,
Thinh