2.2.14 SMP 3com905: transmit timed out: Odd lost irq and ip-stack lockup

From: Dr. Michael Weller (eowmob@exp-math.uni-essen.de)
Date: Wed Oct 11 2000 - 15:42:46 EST


Dear list,

I run a Compaq Proliant 1500 (dual Pentium 75.200) with hardware raid
(Smart2) with two ethernet cards 3com905 (b or c, I can't tell you right
now) as a firewall and web/mail virus scanner which (needless to say)
needs to be up 7d/24h.

Recently, during a pretty fast download the machine (ethernet technically,
you could login on the console, even ping the ethernet ip address) locked
up with the following error log:

Oct 9 17:29:02 fwintern kernel: eth0: transmit timed out, tx_status
00 status e681.
Oct 9 17:29:02 fwintern kernel: eth0: Interrupt posted but not
delivered -- IRQ blocked by another device?
Oct 9 17:29:02 fwintern kernel: Flags; bus-master 1, full 0; dirty
6543269 current 6543269.
Oct 9 17:29:02 fwintern kernel: Transmit list 00000000 vs.
cbef4250.
Oct 9 17:29:02 fwintern kernel: 0: @cbef4200 length 800005ea
status 000105ea
Oct 9 17:29:02 fwintern kernel: 1: @cbef4210 length 800005ea
status 000105ea
Oct 9 17:29:02 fwintern kernel: 2: @cbef4220 length 800005ea
status 000105ea
Oct 9 17:29:02 fwintern kernel: 3: @cbef4230 length 800005ea
status 800105ea
Oct 9 17:29:02 fwintern kernel: 4: @cbef4240 length 800005ea
status 800105ea
Oct 9 17:29:02 fwintern kernel: 5: @cbef4250 length 800005ea
status 000105ea
Oct 9 17:29:02 fwintern kernel: 6: @cbef4260 length 800005ea
status 000105ea
Oct 9 17:29:02 fwintern kernel: 7: @cbef4270 length 800005ea
status 000105ea
Oct 9 17:29:02 fwintern kernel: 8: @cbef4280 length 800005ea
status 000105ea
Oct 9 17:29:02 fwintern kernel: 9: @cbef4290 length 800005ea
status 000105ea
Oct 9 17:29:02 fwintern kernel: 10: @cbef42a0 length 800005ea
status 000105ea
Oct 9 17:29:02 fwintern kernel: 11: @cbef42b0 length 800005ea
status 000105ea
Oct 9 17:29:02 fwintern kernel: 12: @cbef42c0 length 800005ea
status 000105ea
Oct 9 17:29:02 fwintern kernel: 13: @cbef42d0 length 800005ea
status 000105ea
Oct 9 17:29:02 fwintern kernel: 14: @cbef42e0 length 800005ea
status 000105ea
Oct 9 17:29:02 fwintern kernel: 15: @cbef42f0 length 800005ea
status 000105ea
Oct 9 17:29:02 fwintern kernel: eth0: Resetting the Tx ring pointer.
Oct 9 17:29:02 fwintern kernel: Packet log: input DENY eth1 PROTO=17
10.1.1.200:138 10.1.2.2:138 L=257 S=0x00 I=34892 F=0x0000 T=126 (#3)
Oct 9 17:29:12 fwintern kernel: eth0: transmit timed out, tx_status
00 status e601.
Oct 9 17:29:12 fwintern kernel: eth0: Interrupt posted but not
delivered -- IRQ blocked by another device?
Oct 9 17:29:12 fwintern kernel: Flags; bus-master 1, full 0; dirty
6543285 current 6543285.
Oct 9 17:29:12 fwintern kernel: Transmit list 00000000 vs.
cbef4250.
... repeating ...

The problem was reproducible (several times) with the same download (a
300MB file) after a reboot.

There should be no irq collision on irq 9, /proc/interrupts:

           CPU0 CPU1
  0: 4444438 4611949 IO-APIC-edge timer
  1: 2077 2163 IO-APIC-edge keyboard
  2: 0 0 XT-PIC cascade
  5: 390236 391327 IO-APIC-level eth1
  8: 2 0 IO-APIC-edge rtc
  9: 435924 436646 IO-APIC-level eth0
 10: 17 17 IO-APIC-level ncr53c8xx
 11: 24525 24737 IO-APIC-level ida0
 13: 0 0 XT-PIC fpu
NMI: 0
ERR: 0

No other hardware is installed.

Another observation: Normally the free (non buffer) memory of that machine
is 3MB (of 192MB). At the point of the crash it was down to 1.7MB. I don't
know if this is the cause or just a symptom of the locked up ethernet
card.

Is this a known problem? If it's not a hardware/driver bug, I suspect an
SMP related race condition. Maybe the bios did set up the cpu's / io-apics
wrong? I should mention that now, two days later, I can no longer
reproduce the crash.

This might be due to 2 reasons: 1) The bandwidth when downloading from the
web server was measurably slower (about 1/2) this time (about 600kbit (of
a 2Mb link)) today. 2) the machine was power cycled. Unfortunately the
personnel cannot remember if the problem was only reproducible after a
warm reboot or also after a power cycle.

Finally: /proc/version:
Linux version 2.2.14 (root@fwintern) (gcc version 2.95.2 19991024
(release)) #4 SMP Thu Aug 3 20:32:32 CEST 2000

The recent discussion on linux-kernel seems to indicate that gcc 2.95
cannot compile a kernel at all. Could this be the cause here (the system
was up since that Aug 3 flawlessly). Unfortunately there is no other
version available on that system and the distributor (Suse) seems to deny
any gcc 2.95 - kernel incompatibilies.

Please advise,
Michael.

--

Michael Weller: eowmob@exp-math.uni-essen.de, eowmob@ms.exp-math.uni-essen.de, or even mat42b@spi.power.uni-essen.de. If you encounter an eowmob account on any machine in the net, it's very likely it's me.

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Oct 15 2000 - 21:00:19 EST