Re: [PATCH 2/2] usb: host: xhci: Handle the right timeout command

From: Mathias Nyman
Date: Wed Dec 21 2016 - 07:56:19 EST


On 21.12.2016 08:57, Lu Baolu wrote:
Hi Mathias,

I have some comments for the implementation of
xhci_handle_command_timeout() as well.

On 12/20/2016 11:13 PM, Mathias Nyman wrote:
On 20.12.2016 09:30, Baolin Wang wrote:
...

Alright, I gathered all current work related to xhci races and timeouts
and put them into a branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git timeout_race_fixes

Its based on 4.9
It includes a few other patches just to avoid conflicts and make my life easier

Interesting patches are:

ee4eb91 xhci: remove unnecessary check for pending timer
0cba67d xhci: detect stop endpoint race using pending timer instead of counter.
4f2535f xhci: Handle command completion and timeout race
b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort command
529a5a0 usb: xhci: fix possible wild pointer
4766555 xhci: Fix race related to abort operation
de834a3 xhci: Use delayed_work instead of timer for command timeout
69973b8 Linux 4.9

The fixes for command queue races will go to usb-linus and stable, the
reworks for stop ep watchdog timer will go to usb-next.

Still completely untested, (well it compiles)

Felipe gave instructions how to modify dwc3 driver to timeout on address
devicecommands to test these, I'll try to set that up.

All additional testing is welcome, especially if you can trigger timeouts
and races

-Mathias



I post the code below and add my comments in line.

1276 void xhci_handle_command_timeout(struct work_struct *work)
1277 {
1278 struct xhci_hcd *xhci;
1279 int ret;
1280 unsigned long flags;
1281 u64 hw_ring_state;
1282
1283 xhci = container_of(to_delayed_work(work), struct xhci_hcd, cmd_timer);
1284
1285 spin_lock_irqsave(&xhci->lock, flags);
1286
1287 /*
1288 * If timeout work is pending, or current_cmd is NULL, it means we
1289 * raced with command completion. Command is handled so just return.
1290 */
1291 if (!xhci->current_cmd || delayed_work_pending(&xhci->cmd_timer)) {
1292 spin_unlock_irqrestore(&xhci->lock, flags);
1293 return;
1294 }
1295 /* mark this command to be cancelled */
1296 xhci->current_cmd->status = COMP_CMD_ABORT;
1297
1298 /* Make sure command ring is running before aborting it */
1299 hw_ring_state = xhci_read_64(xhci, &xhci->op_regs->cmd_ring);
1300 if ((xhci->cmd_ring_state & CMD_RING_STATE_RUNNING) &&
1301 (hw_ring_state & CMD_RING_RUNNING)) {
1302 /* Prevent new doorbell, and start command abort */
1303 xhci->cmd_ring_state = CMD_RING_STATE_ABORTED;
1304 spin_unlock_irqrestore(&xhci->lock, flags);
1305 xhci_dbg(xhci, "Command timeout\n");
1306 ret = xhci_abort_cmd_ring(xhci);
1307 if (unlikely(ret == -ESHUTDOWN)) {
1308 xhci_err(xhci, "Abort command ring failed\n");
1309 xhci_cleanup_command_queue(xhci);
1310 usb_hc_died(xhci_to_hcd(xhci)->primary_hcd);
1311 xhci_dbg(xhci, "xHCI host controller is dead.\n");
1312 }
1313 return;
1314 }
1315
1316 /* host removed. Bail out */
1317 if (xhci->xhc_state & XHCI_STATE_REMOVING) {
1318 spin_unlock_irqrestore(&xhci->lock, flags);
1319 xhci_dbg(xhci, "host removed, ring start fail?\n");
1320 xhci_cleanup_command_queue(xhci);
1321 return;
1322 }

I think this part of code should be moved up to line 1295.

The XHCI_STATE_REMOVING and XHCI_STATE_DYING needs a rework,
I'm working on that.

Basically we want XHCI_STATE_REMOVING to mean that all devices are going,
away and driver will be removed. Don't bother with re-calculating available
bandwidths after every device removal, but do use xhci hardware to disable
devices cleanly etc.

XHCI_STATE_DYING should mean hardware is not working/responding. Don't
bother writing any registers or queuing anything. Just return all
pending and cancelled URBs, notify core we died, and free all allocated memory.


1323
1324 /* command timeout on stopped ring, ring can't be aborted */
1325 xhci_dbg(xhci, "Command timeout on stopped ring\n");
1326 xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
1327 spin_unlock_irqrestore(&xhci->lock, flags);

This part of code is tricky. I have no idea about in which case should this
code be executed? Anyway, we shouldn't call xhci_handle_stopped_cmd_ring()
here, right?


This isn't changed it these patches.

It will remove the aborted commands and restart the ring. It's useful if we
want to abort a command but command ring was not running. (if for some
unkown reason it was stopped, or forgot to restart.

-Mathias