Re: [PATCH] usb: xhci: bound wait command completion to avoid kdump deadlock
From: Michal Pecio
Date: Thu Apr 30 2026 - 17:55:07 EST
On Thu, 30 Apr 2026 14:27:59 -0300, Desnes Nunes wrote:
> As for how I saw HSE, while testing the patch before submission, since
> I already had the xhci lock, I just added a read of the usbsts before
> calling xhci_hc_died(xhci):
>
> ...
> - wait_for_completion(command->completion);
> - slot_id = command->slot_id;
> + if (!wait_for_completion_timeout(command->completion,
> + msecs_to_jiffies(2 *
> command->timeout_ms))) {
> + spin_lock_irqsave(&xhci->lock, tflags);
> + usbsts = readl(&xhci->op_regs->status);
> + xhci_err(xhci,
> + "TRB_ENABLE_SLOT: no command completion after %lums, USBSTS:%s\n",
> + 2 * command->timeout_ms,
> + xhci_decode_usbsts(ststr, usbsts));
> + xhci_hc_died(xhci);
> + spin_unlock_irqrestore(&xhci->lock, tflags);
> + }
> ...
>
> This debug version of the patch printed:
>
> [ 17.481330] xhci_hcd 0000:80:14.0: TRB_ENABLE_SLOT: no command
> completion after 10000ms, USBSTS: 0x00000015 HCHalted HSE PCD
OK, so this chip is busted at that point. But it might still be better
to improve xhci_handle_command_timeout() to deal with this and complete
the command, instead of patching here and in other similar places.
> Actually, from the beginning of all my debugging I already had
> `usbcore.dyndbg=+p xhci_hcd.dyndbg=+p xhci_pci.dyndbg=+p` on the
> kernel cmdline, as well as on the crashkernel's
> KDUMP_COMMANDLINE_APPEND at /etc/sysconfig/kdump.
>
> On crashkernel's kexec-dmesg of the unpatched kernel I see multiple
> doorbell rings stating the HSE:
>
> ...
> [Thu Apr 30 12:28:22 2026] xhci_hcd 0000:80:14.0: Command timeout,
> USBSTS: 0x00000015 HCHalted HSE PCD
> [Thu Apr 30 12:28:22 2026] xhci_hcd 0000:80:14.0: Command timeout on
> stopped ring
> [Thu Apr 30 12:28:22 2026] xhci_hcd 0000:80:14.0: Turn aborted command
> 000000005921b827 to no-op
> [Thu Apr 30 12:28:22 2026] xhci_hcd 0000:80:14.0: // Ding dong!
> ...
Hmm, the "Command timeout on stopped ring" case doesn't obviously lead
to any immediate command completion, and ringing the command doorbell
under HSE won't achieve any progress. Maybe that's the bug.
Could you post full crash kernel dmesg up to that point? Not sure how
it got to this place.
When xhci_handle_command_timeout() logs USBSTS, does it help to add:
if (usbsts & STS_FATAL) {
xhci_halt(xhci);
xhci_hc_died(xhci);
goto time_out_completed;
}
It may not be perfect solution (race conditions?) but it could hint
that we are on the right track, if it works.
Regards,
Michal