RE: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI commmand timeout

From: Rajesh Bhagat
Date: Mon Mar 21 2016 - 00:33:41 EST




> -----Original Message-----
> From: Mathias Nyman [mailto:mathias.nyman@xxxxxxxxxxxxxxx]
> Sent: Friday, March 18, 2016 4:51 PM
> To: Rajesh Bhagat <rajesh.bhagat@xxxxxxx>; linux-usb@xxxxxxxxxxxxxxx; linux-
> kernel@xxxxxxxxxxxxxxx
> Cc: gregkh@xxxxxxxxxxxxxxxxxxx; mathias.nyman@xxxxxxxxx; Sriram Dash
> <sriram.dash@xxxxxxx>
> Subject: Re: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI
> commmand timeout
>
> On 18.03.2016 09:01, Rajesh Bhagat wrote:
> > We are facing issue while performing the system resume operation from
> > STR where XHCI is going to indefinite hang/sleep state due to
> > wait_for_completion API called in function xhci_alloc_dev for command
> > TRB_ENABLE_SLOT which never completes.
> >
> > Now, xhci_handle_command_timeout function is called and prints
> > "Command timeout" message but never calls complete API for above
> > TRB_ENABLE_SLOT command as xhci_abort_cmd_ring is successful.
> >
> > Solution to above problem is:
> > 1. calling xhci_cleanup_command_queue API even if xhci_abort_cmd_ring
> > is successful or not.
> > 2. checking the status of reset_device in usb core code.
>
>
> Hi
>
> I think clearing the whole command ring is a bit too much in this case.
> It may cause issues for all attached devices when one command times out.
>


Hi Mathias,

I understand your point, But I want to understand how would completion handler be called
if a command is timed out and xhci_abort_cmd_ring is successful. In this case all the code
would be waiting on completion handler forever.


> We need to look in more detail why we fail to call completion for that one aborted
> command.
>

I checked the below code, Please correct me if I am wrong

code waiting on wait_for_completion:
int xhci_alloc_dev(struct usb_hcd *hcd, struct usb_device *udev)
{
...
ret = xhci_queue_slot_control(xhci, command, TRB_ENABLE_SLOT, 0);
...

wait_for_completion(command->completion); <=== waiting for command to complete


code calling completion handler:
1. handle_cmd_completion -> xhci_complete_del_and_free_cmd
2. xhci_handle_command_timeout -> xhci_abort_cmd_ring(failure) -> xhci_cleanup_command_queue -> xhci_complete_del_and_free_cmd

In our case command is timed out, Hence we hit the case #2 but xhci_abort_cmd_ring is success which
does not calls complete.


> The bigger question is why the timeout happens in the first place?
>

We are doing suspend resume operation, It might be controller issue :(, IMO software should not
hang/stop if hardware is not behaving correct.

> What kernel version, and what xhci vendor was this triggered on?
>

We are using 4.1.8 kernel

> It's possible that the timeout is related either to the locking issue found by Chris
> Bainbridge:
> http://marc.info/?l=linux-usb&m=145493945408601&w=2
>
> or the resume issues in this thread, (see full thread)
> http://marc.info/?l=linux-usb&m=145477850706552&w=2
>
> Does any of those proposed solutions fix the command timeout for you?
>

I will check the above patches and share status.

> -Mathias