Re: [2.6.30-rc2] usb reset during big file transfer and ext3 error

From: RogÃrio Brito
Date: Fri May 01 2009 - 05:16:17 EST


Hi, Alan.

Sorry for the late reply, but I had some problems with an HD of mine
giving me trouble. :-( Of course, I have backups. :-)

On Apr 22 2009, Alan Stern wrote:
> On Wed, 22 Apr 2009, RogÃrio Brito wrote:
> > Is there any way of controlling the number of retries in the host
> > controller? Or, perhaps, of controlling the time between retries so
> > that the device can shape it up again?
>
> It's not all that simple. The host controller allows the OS to set the
> number of hardware retries to 1, 2, 3, or unlimited. Linux uses 3;
> those XactErr debugging messages in your log show that the driver was
> extending the number of retries in software.

Right. I didn't know that. Obviously, setting it to unlimited can give
undefined behavior of the computer.

> It's not possible to change the time interval between retries done by
> the hardware. While it is possible in theory to change the interval
> between retries done by the driver, it would be rather difficult and
> so ehci-hcd doesn't attempt it.

Oh, what a pity. It seems that the device at hand sort of gets in shape
again after some time, since I have an automounter here and the device
nodes appear again under dev and it auto-mounts the device at the
appropriate mount point. Weird.

> The software retries were introduced to solve one particular problem:
> Many EHCI controllers will generate a transaction error if a data
> transfer is occurring on one port at the same time as a device is
> being unplugged on another port.

Right. I just got myself a (non powered) USB hub and I noticed one thing
(unrelated to this problem): if I plug a USB disk to this hub and, then,
plug a printer, very weird things happen, like the disc being unmounted
or things like that.

I can give you precise details of what happens here, if you're
interested.

OTOH, I think that I may be seeing some other problems with a pen drive
being connected to a port of this machine I'm typing this message on. I
will try to compile a newer kernel, now that -rc4 is released and I
would appreciate if you could help me with the issues that I'm seeing.

> This is clearly a hardware bug, and the software retries were intended
> to work around it. In practice only a couple of software retries are
> needed; if the transfer hasn't succeeded by that point then it's never
> going to succeed. I set the upper limit to 32 retries just to be
> conservative.

OK. Thanks for the nice and clear explanation of the problem. I only
wonder why I not seeing these errors on other machines while I *do* see
them on other machines (this one is an intel ICH5).

> If transaction errors aren't caused by noise in the cable then they
> are almost always caused by bugs or failures in the device.

I will try again with a shorter and newer cable. Let's see how that
works. BTW, is there any way to check the quality of a cable? I have a
multimeter here and I would be willing to do some extensive tests.
Testing the USB enclosure is also pretty feasible.

> Once a device's firmware has crashed, it doesn't magically fix itself.

Oh, what a pity that it doesn't recovers itself with a watchdog-like
mechanism.


Thanks for all your help, RogÃrio.

--
RogÃrio Brito : rbrito@{mackenzie,ime.usp}.br : GPG key 1024D/7C2CAEB8
http://www.ime.usp.br/~rbrito : http://meusite.mackenzie.com.br/rbrito
Projects: algorithms.berlios.de : lame.sf.net : vrms.alioth.debian.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/