[BUG] Onboard Ethernet Pro 100 on a SMP box: a very strange errors

From: Denis Zaitsev
Date: Fri Jan 21 2005 - 15:54:49 EST


The long story is:

There is a Dual-processor Intel Server Board STL2 with two P-III/800
and an onboard Intel 82557-based ethernet card. The box has all the
/usr and nearly all of the /var filesystems mounted over NFS. And the
box works for months without any problems around the NFS. So, I think
that the ethernet card just works fine.

But I have some enigmatic problems when I copying _some_ files from an
NFS to the local fs: the process is freezes on the middle.

1) Only _some_ files can't be copied. There are:

gcc-testsuite-3.4-20041217.tar.bz2

krb5-1.3.6-signed.tar

X430src-1.tgz

They are the well-known sources from the well-known ftp and web
places. And I don't think that it's the full list, just the files
for which I have met the problem.

2) Only _these_ files can't be copied. Any other is copied plainly.

3) These files _never_ can be copied.

4) The copy process always freezes at the same place (per file - the
each file has its own place).

In short: it's a list of files, on which the copying is always freezes
and always freezes exactly the same way. And there are no any
exception - I have freezeng each time.

The freezing is forever. The freezed process is in D state, its
/proc/PID/wchan contains page_sync. Each such process eats 1.0 from
/proc/loadavg. And the process can't be killed by any signal.

Then, copying by dd bs=1024 ... just succeeds. After that cp succeeds
too - I think it's because of caching.

Then, there is no visual correlation with the size of the file. So,
it seems that the content of the file is involved... But it is
enigmatic.

The NIC works fine all the other time, so there no suspicions about
hardware problems.

The other NIC - 3C905 PCI external card doesn't show the problem - all
the files are just copied.

An either driver for Ethernet Pro 100 - e100 or eepro100 - show the
same result, but eepro100 logs periodicaly:

eth0: wait_for_cmd_done timeout!

e100 logs nothing.

So, it doesn't ever look like a driver bug...

The kernels tested: 2.6.8.1, 2.6.9, 2.6.10.

GLIBC used: 2.3.2.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/