Re: [PATCH RFT RFC] usb: xhci: Kill hosts with HCE or HSE on command timeout

From: Desnes Nunes

Date: Sat May 02 2026 - 07:38:56 EST

Hello again Michal,

On Sat, May 2, 2026 at 6:47 AM Michal Pecio <michal.pecio@xxxxxxxxx> wrote:
>
> On Fri, 1 May 2026 11:09:27 -0300, Desnes Nunes wrote:
> > On Thu, Apr 30, 2026 at 6:55 PM Michal Pecio <michal.pecio@xxxxxxxxx> wrote:
> > > When xhci_handle_command_timeout() logs USBSTS, does it help to add:
> > >
> > > if (usbsts & STS_FATAL) {
> > > xhci_halt(xhci);
> > > xhci_hc_died(xhci);
> > > goto time_out_completed;
> > > }
> > > It may not be perfect solution (race conditions?) but it could hint
> > > that we are on the right track, if it works.
> >
> > This panicked the system as soon as I hit `echo c > /proc/sysrq-trigger`:
> >
> > [ 141.683476] sysrq: Trigger a crash
> > [ 141.686970] Kernel panic - not syncing: sysrq triggered crash
>
> Damn, that sucks. Any chance it's not a problem with my proposed change
> but some sort of issue on your side?

Indeed - bummer.

Don't think so, since I'm using the same system and procedures I used
for the wait_for_completion_timeout() patch:
https://lore.kernel.org/linux-usb/20260430014817.2006885-1-desnesn@xxxxxxxxxx/T/#ma6ce987cea510349082831bbb822136e5c5c57da

> Anyway, I think the patch below might cover it. It works for me in the
> sense that the bus does get killed, without ill effect. I tested on
> VL805 where HSE is easily triggered by disabling XHCI_TRB_OVERFETCH.
> However, the patch isn't necessary here - VL805 doesn't clear CRCR.CRR
> on HSE, so normal abort path is taken and times out, then hc_died().
...
> I also wonder if it wouldn't make sense to just hc_died() on every
> command timeout except Address Device. We rely on Stop Endpoint
> timeouts to kill chips which go unresponsive without setting HCE/HSE,
> because sooner or later somebody loses patience and unlinks an URB,
> but this story (real or hallucinated, but plausible) shows that this
> may not help when there are no devices created yet.
>
> ---
>
> diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
> index e5823650850a..3041deb67b57 100644
> --- a/drivers/usb/host/xhci-ring.c
> +++ b/drivers/usb/host/xhci-ring.c
> @@ -1761,13 +1761,15 @@ void xhci_handle_command_timeout(struct work_struct *work)
> /* mark this command to be cancelled */
> xhci->current_cmd->status = COMP_COMMAND_ABORTED;
>
> - /* Make sure command ring is running before aborting it */
> + /* check for crashed or disconnected chip */
> hw_ring_state = xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
> - if (hw_ring_state == ~(u64)0) {
> + if (hw_ring_state == ~(u64)0 || usbsts & (STS_FATAL | STS_HCE)) {
> + xhci_info(xhci, "kill the damn thing\n");
> xhci_hc_died(xhci);
> goto time_out_completed;
> }
>
> + /* Make sure command ring is running before aborting it */
> if ((xhci->cmd_ring_state & CMD_RING_STATE_RUNNING) &&
> (hw_ring_state & CMD_RING_RUNNING)) {
> /* Prevent new doorbell, and start command abort */

FYI, sorry to be the bearer of bad news, but this also panics the
system as soon as I run `echo c > /proc/sysrq-trigger`. Kdump doesn't
run and no vmcore is produced:

==========
Kernel 7.0.0-michal.pecio.v2 on x86_64

FQDN login:
[ 1063.290020] sysrq: Trigger a crash
[ 1063.293504] Kernel panic - not syncing: sysrq triggered crash
[ 1063.299348] CPU: 12 UID: 0 PID: 5483 Comm: bash Not tainted
7.0.0-michal.pecio.v2 #1 PREEMPT(full)
[ 1063.308548] Hardware name: Intel Corporation Arrow Lake Client
Platform/MTL-S UDIMM 1DPC EVCRB, BIOS MTLSFWI1.R00.5385.D80.2509230731
09/23/2025
[ 1063.321718] Call Trace:
[ 1063.324210] <TASK>
[ 1063.326347] dump_stack_lvl+0x4e/0x70
[ 1063.330084] vpanic+0x20a/0x410
[ 1063.333281] panic+0x6b/0x70
[ 1063.336210] sysrq_handle_crash+0x1a/0x20
[ 1063.340290] __handle_sysrq.cold+0x99/0xde
[ 1063.344456] write_sysrq_trigger+0x59/0xb0
[ 1063.348624] proc_reg_write+0x5a/0xa0
[ 1063.352348] vfs_write+0xcf/0x450
[ 1063.355721] ksys_write+0x6b/0xe0
[ 1063.359092] do_syscall_64+0x11b/0x6a0
[ 1063.362906] ? do_user_addr_fault+0x206/0x680
[ 1063.367337] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 1063.372474] RIP: 0033:0x7fc8e8a9a544
[ 1063.376112] Code: 89 02 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00
00 00 0f 1f 40 00 f3 0f 1e fa 80 3d a5 cb 0d 00 00 74 13 b8 01 00 00
00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 48 83 ec 28 48 89 54 24
18 48
[ 1063.395199] RSP: 002b:00007ffcb7ccb328 EFLAGS: 00000202 ORIG_RAX:
0000000000000001
[ 1063.402898] RAX: ffffffffffffffda RBX: 00007fc8e8b705c0 RCX: 00007fc8e8a9a544
[ 1063.410153] RDX: 0000000000000002 RSI: 000055ae272da170 RDI: 0000000000000001
[ 1063.417409] RBP: 0000000000000002 R08: 0000000000000073 R09: 00000000ffffffff
[ 1063.424664] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
[ 1063.431919] R13: 000055ae272da170 R14: 0000000000000002 R15: 00007fc8e8b6df00
[ 1063.439177] </TASK>
[ 1063.441691] Kernel Offset: 0x4200000 from 0xffffffff81000000
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 1063.452571] ---[ end Kernel panic - not syncing: sysrq triggered crash ]---
============

Best Regards

Desnes Nunes