Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within aUSB payload burst

From: Sarah Sharp
Date: Tue Jan 07 2014 - 16:21:28 EST

On Tue, Jan 07, 2014 at 05:29:48AM -0800, walt wrote:
> On 01/06/2014 04:31 PM, Sarah Sharp wrote:
> > Hi Walt,
> >
> > I have a couple of patches for you to test.
> > Please only apply the first patch (which is diagnostic only), trigger
> > your issue, and send me the resulting dmesg. Then try applying the
> > other two patches, and see if the issue goes away. (I suspect it won't
> > but I can't be sure.)
> Thanks Sarah. dmesg0 is from the diagnostic patch only. dmesg1 has all
> three patches applied. Some of the messages in dmesg1 fell off the end of
> the kernel buffer, so I may need to make the buffer larger next time but
> I'll need a reminder of how to do it.


> As you suspected, the patches didn't fix the problem, sorry.

Yep, I thought so. I did glean one bit of information from the logs: it
seems that your host does handle no-op TRBs, at least for a while.
However, after a bigger chunk of TRBs, it goes off into la-la-land.

Assuming one of the rings is comprised of two segments:
0xbb711000 (start)
0xbb7113f0 (end)
0xbb711400 (start)
0xbb7117f0 (end)

The log show no-ops were inserted at:
0xbb711370 = 8 no-ops
0xbb7117c0 = 3
0xbb7113b0 = 4
0xbb7113a0 = 5
0xbb7117d0 = 2
0xbb711340 = 11
0xbb711770 = 8
0xbb711230 = 28
0xbb7117e0 = 1
0xbb7117b0 = 4
0xbb7113d0 = 2
0xbb7117b0 = 4
0xbb711340 = 11
0xbb711690 = 22

So the host was able to process 28 no-op TRBs, but failed on 22 no-ops
later. The event ring debugging shows the last event was for
0xbb711680, which is the last TRB before the first no-op inserted before
the host died. There's no Stop Endpoint Command completion, and it
looks like the command was correctly put on the command ring, so it
seems the host is actually hanging for some reason.

Unfortunately, I made a mistake in the debugging patch I sent
you, so it didn't print out the endpoint rings when the host died. I
need that info, to see whether the link TRB was still intact, or if we
over-wrote it and caused the host to go fetch some invalid memory.

Can you please try the attached patch, on top of the previous three
patches, and send me dmesg?

> I find that I can tell in advance whether the copy is going to succeed,
> just by watching the light flicker on the usb3 drive. When the flicker
> is absolutely regular, with no variation whatever, I can tell in 10 or
> 15 seconds that the copy will fail.
> At the same time the light on the main drive goes dark after 10 seconds,
> implying that the usb3 drive stops receiving any data from the main drive
> after 10 seconds, yet the light on the usb3 drive continues to flicker as
> if writing data -- even after the cp officially fails. The light on the
> usb3 drive never stops flickering until I reboot the machine or unplug
> the usb cable.

Interesting. Without a USB analyzer, we can't really tell what's
happening. However, one hypothesis could be that the blinking light is
triggered by an active SCSI command (read/write, etc).

There are three phases of the command: setup, data, and status. I think
your device is getting the setup phase, and the host is dying before it
sends the data phase. If the light blinks when it gets a setup phase,
and turns off when the devices sends a status phase, that would explain
its behavior.

But that's just a hypothesis, I have no idea whether it's correct.

Sarah Sharp