DAC960 and latest kernels

From: ccantwel
Date: Wed Nov 17 2004 - 14:27:40 EST


It seems that I have found a very serious bug relating to the DAC960
driver, it is present in 2.4.27 and 2.6.8.1 but I haven't tried any
development kernels newer than that. This message is meant to inform
the people who may be able to figure out what the problem is and
repair the bug. If anyone wants to respond to me or has further
information on this problem, please cc me directly, as I am not a
member of this mailing list.

There is a bug in latest released kernels for both 2.4 and 2.6
trees that seems to relate to the DAC960 driver. I have tried
2.4.27 and 2.6.8.1 and both have the following problem. After
replacing and testing every piece of hardware I've determined
the problem is in the kernel and not the machine. Also, going
back to kernel 2.4.17 makes everything work perfectly.

When copying a large number of files over the network using nc
and tar and also stress testing on the receiving side with
frequent du commands where the data is being written to (not
required for the problem but it seems to make it more likely
to happen) the machine randomly hangs. Sometimes it crashes
completely with an Unable to handle kernel paging request at
virtual address, with various processes, and sometimes it just
completely halts and responds to nothing even though it seems
to sort of be running (answers on sockets but transfers no data,
no more communication of any kind on other open sockets, entering
a login name and enter on the console also hangs before the
password prompt.

Upon rebooting the filesystem that was being written to is corrupt.
When this file system is XFS I can still inspect it. If it is
reiser the machine will often hang trying to run fsck.reiserfs.

In addition, after enough network tar copying I can stop the process,
and even if it has not hung the machine often the filesystem has
become corrupt. Since this happens with both xfs and reiser and
there are paging problems too I believe it has to relate to the
dac960 driver itself.

I've also tried swapping DAC960 cards, and that doesn't matter,
they both work perfectly in 2.4.17 and the problems occur in
both cases in 2.4.27 and 2.6.8.1. I'm planning to try some
more kernels in between to narrow down when the problem was
introduced but I'm not a kernel developer and that is about the
most I will be able to do.

I think the DAC960 driver isn't used very often and also without this
specific type of stress testing it is hard to realize the problem is
occuring before the filesystem is irreperably corrupt weeks down the
road, which is why this problem may have not been found yet. Without
stress testing using tar the machine can go for weeks and act like it
is fine.

Again, if anyone has any information, please cc me when you reply as
I am not a member of this list.

Thank you
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/