Excessive network collision & reciever lock-up problem

Solitude (solitude@solitude.flinthills.com)
Tue, 6 Jan 1998 17:37:04 -0600 (CST)


Hello!

I have run into a rather curious situation with linux on a mixed 10-mbit /
100-mbit lan. This is going to be a long email and I apoligize in
advance. I'll just cut right to the chase...

I have a LAN with two 10-mbit ethernet 18-port switches and one 8-port
100-mbit (fast ethernet) switch. The two 18 port switches each have two
100-mbit uplinks which are used to link the hubs together. For those of
you not familiar with the term switch, you may be familiar with the term
intelligent hub. In any case, these three switches are the heart of the
office lan. Most of the machines in this network are win95 client boxes,
there are a few winnt-workstations, a few sun sparcs, and three linux
boxes. One of the linux machines is a firewall/router and has one
fast-ethernet connection to the switches mentioned above and another
connection to yet another smaller set of non-intelligent 10-mbit hubs (the
outside network) Another linux machine, used for web, ftp, email, and
such, is on this outside network. The third linux machine is on the
internal network (the switches mentioned above) and has a fast-ethernet
card. It is running samba and has a secondary 6gb drive and is acting as
the company file-server.

Well, this fileserver was working just fine for about three or four months
when it began to experience what I'll call an ethernet-reciever lock-up.
It had (at the time) a 100-mbit 3com 3c59x card. Aka - the vortex card.
When a network client would request a large file (large >= about 20
mbytes) the server would dissapear from the network. When I went to the
server and attempted to initiate an outbound ping to the firewall it would
fail. When I went to any other random machine on the network and
attempted to ping the linux server it would fail. The macintosh clients
would all immediatley complain that their server has dissapeared.
Whenever this happens the only way I can correct the problem is by
rebooting the server. Before rebooting someone pointed out to me that
'ifconfig eth0' had an EXCESSIVE ammount of collisions. The ammount of
collisions is usually about 1/4th of the ammount of sent packets. More on
this later.

Now, at this point most people are going to tell me that the vortex card
is known to be buggy, I should try a different card, etc, etc, etc. In
anticipation of this type of response I tried everything under the sun. I
tried 4 different PCI fast-ethernet cards, all of which had the same
problem, I grabbed a new hub from a different vendor (the switches are
made by bay-networks. The test-hub was made by asante) And I even took
the hard drives out of the failing server and put them in a new box. The
new box had a different scsi controller, yet another new ethernet card,
and literally all-new hardware, except for the two hard drives. The
problem persisted. My test was done as follows: I hooked this all-new
fileserver up to the asante (test) hub with a win95 client. These were
the only two connections on the hub; this was not hooked into the office
network or anything else. This test was also done with new network
cabling that was not used in the previous network setup. I attempted to
copy a 50 meg file from the linux machine to the win95 machine (the xfer
was initiated from the win95 machine over smb) and the problem persisted.
I then tried ftp and the problem persisted. I then did this again from a
different win95 machine, and the problem persisted. Odd, eh? I then
connected the hub to the office network and copied a large (50 mbyte) file
from the firewall mentioned above (a completley different linux machine)
and STILL had the friggin' collision problem. Now, this is the point at
which I am stuck and am asking the linux-kernel community for help. I
think you will need some more detailed specifics however...

"the problem" is this. With the vortex card and some others, such as an
intel etherexpress pro 100, and I don't remember what other cards I tried,
the machine would simply cease to transmit or recieve data from the
network. I went to the linux machine and started pinging the win-95
client. I then transfered the large file (served from the linux machine
to the 95 machine, requested by the 95 machine) and the ping immediatley
stopped. At this point no more data will transfer over the ethernet
device on the linux server. No outbound or inbound pings, or anything
else. Plus, no error messages recorded in syslog. All of these hubs and
switches have a seperate collision light for each ethernet port. During
the transfer the collision light on the linux box'es ethernet port was
going crazy. Matter of fact, any time data is being *succesfully* sent to
or from a linux machine the collision light has been very steadilly
blinking away. Once the ethernet transiever locks up (Ok, I don't actually
know for sure that this is what is happening, but it sounds accurate to
me) the collision light goes out and no more data is able to be sent until
the server is rebooted. Note that the hub has NOT partitioned off the
linux machine's port. (IE the network connection looks just fine, as far
as the hub is concerned) Plus, I did try a different vendors hub and had
the exact same problem occur.

I have found one way to avoid this supposed ethernet transiever lockup
phonemon. I used an ISA 3-com 3c509 card. This is a 10-mbit card. While
using this card I have not been able to cause the linux server to
dissapear from the network and everything seems to work fine, except that
the collisions persist. To give you an idea of the magnitude of this,
during a transfer the collision light is on as solid as a rock. It
dosen't waver or hint of shutting off until the transfer is completed.
'ifconfig eth0' reports collions in the tens or hundreds of thousands. As
I stated before, the ammount of collisions is usually about 1/4th the
ammount of total packets sent. This is causing the network server to be
stable, yet very very slow. I suppose slow is preferable to non-existant,
as far as network connections go. Anyway, any assistance would be greatly
appriceated!

The three linux boxes (and the fourth newly built box for testing
purposes) are all pentiums. They are all 133mhz or greater, and all
(obviouslly) have a pci bus. I tested every single linux machine and
found I could cause this collision problem to occur on all of them, yet it
dosen't affect any non-linux machine on the network. (IE a 50-meg win95
to win95 xfer works just fine) The linux machines all have at least 16 mb
of ram, and the main fileserver has 128 megs of ram. I believe all the
simms are 60ns. The fileserver has an aha2940 controller, and its newly
built box was using an aha2920 controller. The ethernet cards I have
tried (that I can remember) are: The 3c590, 3c950, 3c509 (that's the isa
card that worked), the intel ether-express pro 100, and an ne2000 isa
clone. I was going to try a pci linksys 100mbit card and found to my
dismay that linux does not have driver support for this card. All the
boxes are slackware systems running kernel 2.0.32 and all the ethernet
drivers have been built as kernel modules. I do want to make it clear
that these machines have never locked up (crashed) or suffered and other
anomolious problems which to my knowledge may affect this situation. When
I said lockup above I have been refering to the ethernet transiever
ceasing to transfer data.

----

Thanks for a great OS!

- John Frear
<solitude@solitude.flinthills.com>
<jsf8471@ksu.edu>

_/_/_/_/_/ _/_/_/ _/ _/ _/ _/
_/ _/ _/ _/ _/ _/_/ _/
_/ _/ _/ _/_/_/_/ _/ _/ _/
_/ _/ _/ _/ _/ _/ _/ _/_/
_/_/ _/_/_/ _/ _/ _/ _/

NO SOLICITING!
-----BEGIN PGP PUBLIC KEY BLOCK-----
Version: 2.6.3ia

mQCNAzQGhkEAAAEEAM3QKHqclaVqT3DYHXopHFbfTtx3MeDOXJgNAfxRulKwTZan
BL1+Ld6/x/wAHZT6pr2YifQFn/pKt+K3GODM4wfAtE4Us/Os9SE8pI9pcAUfFnUU
o28/LyIbM0rpBSdWuatBNXezD5rMbpdoGgRzExXx+ThoLkTt6wGB7TRORpJNAAUT
tBxKb2huIEZyZWFyIDxqc2Y4NDcxQGtzdS5lZHU+iQCVAwUQNAaGQQGB7TRORpJN
AQHLGAP/YK5XkVD0F0E81KHlcUhu/0zJrEUWlpmjfw/D0DkJTvI1oxjGHeeyKuF1
CKmZu0kQIeoT8Z/Fh9Lel5YNAt2IPf5FarLFyxfSytd/e0BSMZdisOPHS8JT3nqi
CIyZKhrusb77vUXOKoOs83QoQYaPTUjT3t2H88dKfSdSN8+PusA=
=yXiZ
-----END PGP PUBLIC KEY BLOCK-----