staging: usbip: all I/O dies, how to debug?

From: Alexander Thomas
Date: Mon Aug 29 2011 - 11:23:25 EST


Hello,

I am experimenting with the usbip project that is currently in the
staging drivers tree. I have a particularly nasty problem with it.

Everything works but at random moments all I/O will die system-wide.
In most cases this manifests itself as X11 freezing entirely with a
hardware reset as only way out. Sometimes certain programs like xclock
or top in a terminal will still show activity, but the machine will
not react to anything: no keyboard input, no ping, no ACPI shutdown.
In the logs there is nothing useful except sometimes a message of
(S)ATA being reset. When I'm at the console when it happens it will
show the same (S)ATA reset, attempts to remount the filesystem
read-only, and eventually it keeps repeating things like "Buffer I/O
error", "lost page write" and "unhandled error code
Result=DID_BAD_TARGET driverbyte=DRIVER_OK" every few dozen seconds.

After many experiments I have found that there is one condition that
must be met to trigger the crash: there must be simultaneous USB
traffic from the remote device and a local USB device (moving the
mouse will do). There are also a few other conditions that increase
the probability of it happening:
1. There is other heavy I/O traffic on the client, e.g. disk activity.
Compiling something is a good way to trigger the freeze.
2. The traffic from the remote device is incoming. Although I did
manage to get a crash while playing sound to a remote USB sound card,
it took way longer than when recording sound.

I have tested this with kernels ranging from 2.6.30 to 2.6.38 on two
different physical machines and inside a virtual machine. A possibly
important note is that I first had this exact same problem with a
commercial USB/IP product. I contacted the vendor but they say they
are unable to reproduce the problem. My disappointment was great when
I finally got the open source usbip working on older kernels only to
discover that it kills my system in the same way.

I have tried to debug in a virtual machine but this is problematic because:
a. it involves using a serial connection which also dies together with the rest
b. there is no way to predict when the crash will happen. This is one
of those annoying completely random bugs.
c. I have no experience in debugging at kernel/module level.

Is there anyone who can give pointers as to how to debug a problem
like this, and/or where to look for the cause?

Alexander
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/