Re: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI commmand timeout

From: Mathias Nyman
Date: Tue Mar 29 2016 - 13:14:57 EST


On 28.03.2016 09:13, Rajesh Bhagat wrote:


-----Original Message-----
From: Mathias Nyman [mailto:mathias.nyman@xxxxxxxxxxxxxxx]
Sent: Wednesday, March 23, 2016 7:52 PM
To: Rajesh Bhagat <rajesh.bhagat@xxxxxxx>
Cc: gregkh@xxxxxxxxxxxxxxxxxxx; linux-usb@xxxxxxxxxxxxxxx; linux-
kernel@xxxxxxxxxxxxxxx; Sriram Dash <sriram.dash@xxxxxxx>
Subject: Re: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI
commmand timeout

On 23.03.2016 05:53, Rajesh Bhagat wrote:

IMO, The assumption that "xhci_abort_cmd_ring would always generate
an event and handle_cmd_completion would be called" will not be
always be true if HW
is in bad state.

Please share your opinion.


writing the CA (command abort) bit in CRCR (command ring control
register) will stop the command ring, and CRR (command ring running) will be set
to 0 by xHC.
xhci_abort_cmd_ring() polls this bit up to 5 seconds.
If it's not 0 then the driver considers the command abort as failed.

The scenario you're thinking of is that xHC would still react to CA
bit set, it would stop the command ring and set CRR 0, but not send a command
completion event.

Have you tried adding some debug to handle_cmd_completion() and see
if you receive any event after command abortion?


Yes. We have added debug prints at first line of
handle_cmd_completion, and we are not getting those prints. The last
print messages that we get are as below from xhci_alloc_dev while
resume
operation:

xhci-hcd xhci-hcd.0.auto: Command timeout xhci-hcd xhci-hcd.0.auto:
Abort command ring

May be somehow, USB controller is in bad state and not responding to the
commands.

Please suggest how XHCI driver can handle such situations.


Restart the command timeout timer when writing the command abort bit.
If we get theIf we get the abort event the timer is deleted.

Otherwise if the timout triggers a second time we end up calling
xhci_handle_command_timeout() with a stopped ring, This will call
xhci_handle_stopped_cmd_ring(), turn the aborted command to no-op, restart the
command ring, and finally when the no-op completes it should call the missing
completion.

If command ring doesn't start then additional code could be added to
xhci_handle_command_timeout() that clears the command ring if it is called a second
time (=current command is already in abort state and command ring is stopped when
entering xhci_handle_command_timeout)

There might be some details missing, I'm not able to test any of this, but try
something like this:

diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index
3e1d24c..576819e 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -319,7 +319,10 @@ static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
xhci_halt(xhci);
return -ESHUTDOWN;
}
-
+ /* writing the CMD_RING_ABORT bit should create a command completion
+ * event, add a command completion timeout for it as well
+ */
+ mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT);
return 0;
}

Hello Mathias,

Thanks for the patch.

After application of above patch, I'm getting following prints constantly:

xhci-hcd xhci-hcd.0.auto: Command timeout
xhci-hcd xhci-hcd.0.auto: Abort command ring
xhci-hcd xhci-hcd.0.auto: Command timeout on stopped ring
xhci-hcd xhci-hcd.0.auto: Turn aborted command be56e000 to no-op
xhci-hcd xhci-hcd.0.auto: // Ding dong!
...
xhci-hcd xhci-hcd.0.auto: Command timeout
xhci-hcd xhci-hcd.0.auto: Abort command ring
xhci-hcd xhci-hcd.0.auto: Command timeout on stopped ring
xhci-hcd xhci-hcd.0.auto: Turn aborted command be56e000 to no-op
xhci-hcd xhci-hcd.0.auto: // Ding dong!

As expected, xhci_handle_command_timeout is called again and next time ring state
is __not__ CMD_RING_STATE_RUNNING, Hence xhci_handle_stopped_cmd_ring is called
which turn all the aborted commands to no-ops and again makes the ring state as
CMD_RING_STATE_RUNNING, and rings the door bell.

But again in this case, no response from USB controller and xhci_alloc_dev is still waiting for
wait_for_completion.

Does rescheduling the xhci->cmd_timer ensures command completion will be called. IMO, it is still
dependent on HW response.


It's still dependent on hw starting the command ring, which apparently fails

So additional code is needed that will treat a second or third timeout as
a failed abort attempt, i.e. if a command times out on a stopped ring and the
command status is already aborted, then we cleanup the command ring and kill xhci.

So in addition to the previous added timeout something like this is needed:
(Again, uncompiled, untested semi pseudocode)


diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c
index 7cf6621..d8518e8f 100644
--- a/drivers/usb/host/xhci-ring.c
+++ b/drivers/usb/host/xhci-ring.c
@@ -1253,6 +1253,7 @@ void xhci_handle_command_timeout(unsigned long data)
int ret;
unsigned long flags;
u64 hw_ring_state;
+ u32 old_status;
struct xhci_command *cur_cmd = NULL;
xhci = (struct xhci_hcd *) data;
@@ -1260,6 +1261,7 @@ void xhci_handle_command_timeout(unsigned long data)
spin_lock_irqsave(&xhci->lock, flags);
if (xhci->current_cmd) {
cur_cmd = xhci->current_cmd;
+ old_status = cur_cmd->status;
cur_cmd->status = COMP_CMD_ABORT;
}
@@ -1282,7 +1284,14 @@ void xhci_handle_command_timeout(unsigned long data)
}
/* command timeout on stopped ring, ring can't be aborted */
xhci_dbg(xhci, "Command timeout on stopped ring\n");
- xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
+ /* is this the second timeout for the same command? ring wont restart */
+ if (old_status == COMP_CMD_ABORT) {
+ xhci_err(xhci, "Abort timeout, Fail to restart cmd ring\n");
+ xhci_cleanup_command_queue(xhci);
+ /* everything else here to halt, cleanup, free etc kill xhc */
+ } else {
+ xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
+ }
spin_unlock_irqrestore(&xhci->lock, flags);
return;
}

-Mathias