eepro100 lockup after months of no problems (v1.06, 2.2.10ac5)

Simon Kirby (sim@stormix.com)
Mon, 29 Nov 1999 00:40:33 -0500


Hello,

I work for a WPP with so many Linux servers that I've lost count now.
All of the x86 servers have Intel eepro100 cards in them that have never
shown problems, until just now. Something odd happened on our
routing/firewall box that has 4 eepro100 cards and routes traffic between
many networks.

The machine is running 2.2.10-ac5 with the stock eepro100 driver that's
in this kernel. It appears to be v1.06. This is a (very) production
machine, so we don't change things or experiment on it very often.

This problem has never happened before, but I think it might be similar
to problems that other people have noticed (eepro100 getting stuck). The
box stopped sending and receiving packets on eth1, and so I logged in via
eth0 (the Internet) to see what was up.

Syslog showed:

Nov 28 20:31:15 worf kernel: eth1: Transmit timed out: status 0050 0080 at -816260625/-816260610 command 000c0000.
Nov 28 20:31:15 worf kernel: eth1: Trying to restart the transmitter...
Nov 28 20:31:20 worf kernel: eth1: Transmit timed out: status 0050 0090 at -816260625/-816260610 command 000c0000.
Nov 28 20:31:20 worf kernel: eth1: Trying to restart the transmitter...
Nov 28 20:31:25 worf kernel: eth1: Transmit timed out: status 0050 0090 at -816260625/-816260610 command 000c0000.
...

Miquel, I remember seeing you say that you had problems with your
eepro100s getting stuck sometimes on a high traffic server. Did you also
see messages like this?

"ifconfig eth0" showed normal statistics, and the interface was working
(didn't record it, sorry).

"ifconfig eth1" showed really weird statistics (this was the broken
interface):

[sroot@worf:/root]# ifconfig eth1
eth1 Link encap:Ethernet HWaddr 00:90:27:0F:95:B3
inet addr:204.174.223.254 Bcast:204.174.223.255 Mask:255.255.255.0
UP BROADCAST RUNNING MTU:1500 Metric:1
RX packets:15 errors:54 dropped:0 overruns:0 frame:0
TX packets:2147483647 errors:391 dropped:0 overruns:1 carrier:0
collisions:0 txqueuelen:100
Interrupt:12 Base address:0xa400

I recorded this and "ipchains -vnL" output, but the ipchains output
doesn't seem to contain anything odd.

The "TX" packets is very odd (INT_MAX?). The box has been up and running
for 39 days, and it's quite possible that the TX packet count could have
reached this high (and even wrapped many times), but it seems very odd
that it got stuck at (or got overwritten with) that number.

Removing and reinserting the eepro100 module (remotely) seemed to have
fixed it. This is the output I get when inserting the module:

eepro100.c:v1.06 10/16/98 Donald Becker http://cesdis.gsfc.nasa.gov/linux/drivers/eepro100.html
eth0: OEM i82557/i82558 10/100 Ethernet at 0xa800, 00:90:27:0F:96:8B, IRQ 3.
Board assembly 697680-002, Physical connectors present: RJ45
Primary interface chip i82555 PHY #1.
Forcing 100Mbs full-duplex operation.
General self-test: passed.
Serial sub-system self-test: passed.
Internal registers self-test: passed.
ROM checksum self-test: passed (0x24c9f043).
Receiver lock-up workaround activated.
eth1: OEM i82557/i82558 10/100 Ethernet at 0xa400, 00:90:27:0F:95:B3, IRQ 12.
Board assembly 697680-002, Physical connectors present: RJ45
Primary interface chip i82555 PHY #1.
Forcing 100Mbs full-duplex operation.
General self-test: passed.
Serial sub-system self-test: passed.
Internal registers self-test: passed.
ROM checksum self-test: passed (0x24c9f043).
Receiver lock-up workaround activated.
eth2: OEM i82557/i82558 10/100 Ethernet at 0xa000, 00:90:27:0F:95:7F, IRQ 10.
Board assembly 697680-002, Physical connectors present: RJ45
Primary interface chip i82555 PHY #1.
Forcing 100Mbs full-duplex operation.
General self-test: passed.
Serial sub-system self-test: passed.
Internal registers self-test: passed.
ROM checksum self-test: passed (0x24c9f043).
Receiver lock-up workaround activated.
eth3: OEM i82557/i82558 10/100 Ethernet at 0x9800, 00:90:27:0F:78:58, IRQ 11.
Board assembly 697680-002, Physical connectors present: RJ45
Primary interface chip i82555 PHY #1.
Forcing 100Mbs full-duplex operation.
General self-test: passed.
Serial sub-system self-test: passed.
Internal registers self-test: passed.
ROM checksum self-test: passed (0x24c9f043).
Receiver lock-up workaround activated.

If this is a known problem, please, let me know! I've never seen it
before on any servers, which is odd. Nothing has changed for months that
would have caused this, except perhaps gradually increasing load.

Simon-

[ Stormix Technologies Inc. ][ NetNation Communcations Inc. ]
[ sim@stormix.com ][ sim@netnation.com ]
[ Opinions expressed are not necessarily those of my employers. ]

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/