[PATCH RFT RFC] usb: xhci: Kill hosts with HCE or HSE on command timeout

From: Michal Pecio

Date: Sat May 02 2026 - 05:46:58 EST


On Fri, 1 May 2026 11:09:27 -0300, Desnes Nunes wrote:
> On Thu, Apr 30, 2026 at 6:55 PM Michal Pecio <michal.pecio@xxxxxxxxx> wrote:
> > When xhci_handle_command_timeout() logs USBSTS, does it help to add:
> >
> > if (usbsts & STS_FATAL) {
> > xhci_halt(xhci);
> > xhci_hc_died(xhci);
> > goto time_out_completed;
> > }
> > It may not be perfect solution (race conditions?) but it could hint
> > that we are on the right track, if it works.
>
> This panicked the system as soon as I hit `echo c > /proc/sysrq-trigger`:
>
> [ 141.683476] sysrq: Trigger a crash
> [ 141.686970] Kernel panic - not syncing: sysrq triggered crash

Damn, that sucks. Any chance it's not a problem with my proposed change
but some sort of issue on your side?

Anyway, I think the patch below might cover it. It works for me in the
sense that the bus does get killed, without ill effect. I tested on
VL805 where HSE is easily triggered by disabling XHCI_TRB_OVERFETCH.
However, the patch isn't necessary here - VL805 doesn't clear CRCR.CRR
on HSE, so normal abort path is taken and times out, then hc_died().

Can somebody serious confirm if this issue actually exists in the first
place, and whether the patch solves it?

Hello Redhat, anyone alive there? Or only stochastic parrots?

Mathias, do you remember what's the point of the "Command timeout on
stopped ring" branch? Can it happen in any case other than dead chip?

I also wonder if it wouldn't make sense to just hc_died() on every
command timeout except Address Device. We rely on Stop Endpoint
timeouts to kill chips which go unresponsive without setting HCE/HSE,
because sooner or later somebody loses patience and unlinks an URB,
but this story (real or hallucinated, but plausible) shows that this
may not help when there are no devices created yet.

---

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index e5823650850a..3041deb67b57 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -1761,13 +1761,15 @@ void xhci_handle_command_timeout(struct work_struct *work)
/* mark this command to be cancelled */
xhci->current_cmd->status = COMP_COMMAND_ABORTED;

- /* Make sure command ring is running before aborting it */
+ /* check for crashed or disconnected chip */
hw_ring_state = xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
- if (hw_ring_state == ~(u64)0) {
+ if (hw_ring_state == ~(u64)0 || usbsts & (STS_FATAL | STS_HCE)) {
+ xhci_info(xhci, "kill the damn thing\n");
xhci_hc_died(xhci);
goto time_out_completed;
}

+ /* Make sure command ring is running before aborting it */
if ((xhci->cmd_ring_state & CMD_RING_STATE_RUNNING) &&
(hw_ring_state & CMD_RING_RUNNING)) {
/* Prevent new doorbell, and start command abort */