Re: [PATCH] xhci: print warning when HCE was set

From: liulongfang
Date: Fri Dec 09 2022 - 01:14:52 EST


On 2022/10/14 15:56, Mathias Nyman Wrote:
> On 14.10.2022 6.12, liulongfang wrote:
>> On 2022/9/26 15:58, Mathias Nyman wrote:
>>> On 24.9.2022 5.35, liulongfang wrote:
>>>> On 2022/9/22 21:01, Mathias Nyman Wrote:
>>>>> Hi
>>>>>
>>>>> On 15.9.2022 4.11, Longfang Liu wrote:
>>>>>> When HCE(Host Controller Error) is set, it means that the xhci hardware
>>>>>> controller has an error at this time, but the current xhci driver
>>>>>> software does not log this event.
>>>>>>
>>>>>> By adding an HCE event detection in the xhci interrupt processing
>>>>>> interface, a warning log is output to the system, which is convenient
>>>>>> for system device status tracking.
>>>>>>
>>>>>
>>>>> xHC should cease all activity when it sets HCE, and is probably not
>>>>> generating interrupts anymore.
>>>>>
>>>>> Would probably be more useful to check for HCE at timeouts than in the
>>>>> interrupt handler.
>>>>>
>>>>
>>>> Which function of the driver code is this timeout in?
>>>
>>> xhci_handle_command_timeout() will usually trigger at some point,
>>>
>>
>> Because this HCE error is reported in the form of an interrupt signal, it is more
>> concise to put it in xhci_irq() than in xhci_handle_command_timeout().
>>
>
> Patch was added to queue after you reported your xHC hardware triggers interrupts when HCE is set.
> I'll send it forward after 6.1-rc1
>

In our test version, a test log is added to xhci_irq(). In the test case that triggers HCE,
the HCE interrupt is reported and recorded through the log:

{53}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
{53}[Hardware Error]: event severity: recoverable
{53}[Hardware Error]: Error 0, type: recoverable
{53}[Hardware Error]: section type: unknown, c8b328a8-9917-4af6-9a13-2e08ab2e7586
{53}[Hardware Error]: section length: 0x48
{53}[Hardware Error]: 00000000: 0000186b 00000201 001a0001 00000000 k...............
{53}[Hardware Error]: 00000010: 00000000 00000000 00000000 00000028 ............(...
{53}[Hardware Error]: 00000020: 00000000 00000000 00000000 00000000 ................
{53}[Hardware Error]: 00000030: 00000000 00000000 00000000 00000000 ................
{53}[Hardware Error]: 00000040: 00000001 00000000 ........
xhci_hcd 0000:30:01.0: xHCI host not responding to stop endpoint command.
xhci_hcd 0000:30:01.0: USBSTS: PCD HCE
xhci_hcd 0000:30:01.0: xHCI host controller not responding, assume dead
xhci_hcd 0000:30:01.0: HC died; cleaning up
usb usb1-port1: couldn't allocate usb_device
rmmod xhci-pci
xhci_hcd 0000:30:01.0: remove, state 4
usb usb2: USB disconnect, device number 1
xhci_hcd 0000:30:01.0: USB bus 2 deregistered
xhci_hcd 0000:30:01.0: remove, state 1
usb usb1: USB disconnect, device number 1
xhci_hcd 0000:30:01.0: USB bus 1 deregistered

Thanks,
Longfang.

> xHCI specification still indicate HCE might not trigger interrupts:
>  
> Section 4.24.1 -Internal Errors
> ...
> "Software should implement an algorithm for checking the HCE flag if the xHC is
> not responding."
>
> Thanks
> -Mathias
> .
>