NETDEV WATCHDOG on U60/SMP

From: BERTRAND Joël
Date: Fri Jun 20 2008 - 04:05:59 EST


Hello,

This mail comes from sparclinux mailing list. I repost it on general linux kernel mailing list because I'm not sure that this bug is sparc specific. Nevertheless, I can only reproduce it on sparc64/SMP.

My U60 runs linux debian with official 2.6.25 linux kernel (I'm
currently trying 2.6.25.7) and sometimes, when eth2 is stressed, eth2
hangs with NETDEV WATCHDOG :

NETDEV WATCHDOG: eth2: transmit timed out
eth2: transmit timed out, tx_status 00 status 8601.
diagnostics: net 0ccc media 8880 dma 0000003a fifo 0000
eth2: Interrupt posted but not delivered -- IRQ blocked by another device?
Flags; bus-master 1, dirty 2283344(0) current 2283344(0)
Transmit list 00000000 vs. fffff800af098200.
0: @fffff800af098200 length 00000042 status 0c01059a
1: @fffff800af098260 length 00000042 status 0c01059a
2: @fffff800af0982c0 length 00000042 status 0c01059a
3: @fffff800af098320 length 00000042 status 0c01059a
4: @fffff800af098380 length 00000042 status 0c01059a
5: @fffff800af0983e0 length 00000042 status 0c01059a
6: @fffff800af098440 length 00000042 status 0c01059a
7: @fffff800af0984a0 length 00000042 status 0c01059a
8: @fffff800af098500 length 8000002a status 0001002a
9: @fffff800af098560 length 8000002a status 0001002a
10: @fffff800af0985c0 length 8000002a status 0001002a
11: @fffff800af098620 length 8000002a status 0001002a
12: @fffff800af098680 length 8000002a status 0001002a
13: @fffff800af0986e0 length 8000002a status 0001002a
14: @fffff800af098740 length 8000002a status 8001002a
15: @fffff800af0987a0 length 8000002a status 8001002a
eth2: Resetting the Tx ring pointer.
eth2: setting full-duplex.
NETDEV WATCHDOG: eth2: transmit timed out
eth2: transmit timed out, tx_status 00 status 8601.
diagnostics: net 0ccc media 8880 dma 0000003a fifo 0000
eth2: Interrupt posted but not delivered -- IRQ blocked by another device?
Flags; bus-master 1, dirty 16(0) current 16(0)
Transmit list 00000000 vs. fffff800af098200.
0: @fffff800af098200 length 8000002a status 0001002a
1: @fffff800af098260 length 8000002a status 0001002a
2: @fffff800af0982c0 length 8000002a status 0001002a
3: @fffff800af098320 length 8000002a status 0001002a
4: @fffff800af098380 length 8000002a status 0001002a
5: @fffff800af0983e0 length 8000002a status 0001002a
6: @fffff800af098440 length 8000002a status 0001002a
7: @fffff800af0984a0 length 8000002a status 0001002a
8: @fffff800af098500 length 8000002a status 0001002a
9: @fffff800af098560 length 8000002a status 0001002a
10: @fffff800af0985c0 length 8000002a status 0001002a
11: @fffff800af098620 length 8000002a status 0001002a
12: @fffff800af098680 length 8000002a status 0001002a
13: @fffff800af0986e0 length 8000002a status 0001002a
14: @fffff800af098740 length 8000002a status 8001002a
15: @fffff800af0987a0 length 8000002a status 8001002a
eth2: Resetting the Tx ring pointer.
eth2: setting full-duplex.
...

I have to reboot this server to restore eth2.
This adapter is a 3Com NIC (3C905). I have tried with several different
3Com adapters with the same result. If I change this NIC (for example with a HME or any PCI 2.1 adapter), I cannot reproduce the bug.

It only occurs when ethernet traffic is high on eth2.

I have seen this bug since 2.6.20 even on amd64 (but I'm not sure that this bug remains in amd64 kernel because I don't have any amd64 workstation to test, and I don't see it on amd64 since 2.6.24. Maybe it is fixed on amd64...).

lspci returns :
0000:00:00.0 Host bridge: Sun Microsystems Computer Corp. Psycho PCI Bus
Module
0000:00:01.0 Bridge: Sun Microsystems Computer Corp. EBUS (rev 01)
0000:00:01.1 Ethernet controller: Sun Microsystems Computer Corp. Happy
Meal 10/100 Ethernet [hme] (rev 01)
0000:00:02.0 Ethernet controller: 3Com Corporation 3c905C-TX/TX-M
[Tornado] (rev 78)
0000:00:03.0 SCSI storage controller: LSI Logic / Symbios Logic 53c875
(rev 14)
0000:00:03.1 SCSI storage controller: LSI Logic / Symbios Logic 53c875
(rev 14)
0000:00:04.0 SCSI storage controller: Adaptec AIC-7892A U160/m (rev 02)
0000:00:05.0 USB Controller: NEC Corporation USB (rev 43)
0000:00:05.1 USB Controller: NEC Corporation USB (rev 43)
0000:00:05.2 USB Controller: NEC Corporation USB 2.0 (rev 04)
0001:00:00.0 Host bridge: Sun Microsystems Computer Corp. Psycho PCI Bus
Module
0001:80:01.0 Bridge: Sun Microsystems Computer Corp. EBUS (rev 01)
0001:80:01.1 Ethernet controller: Sun Microsystems Computer Corp. Happy
Meal 10/100 Ethernet [hme] (rev 01)

ifconfig:
eth0 Link encap:Ethernet HWaddr 08:00:20:a1:4b:33
inet adr:192.168.0.128 Bcast:192.168.0.255 Masque:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:16709366 errors:0 dropped:0 overruns:0 frame:1
TX packets:21355942 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 lg file transmission:1000
RX bytes:2391901923 (2.2 GiB) TX bytes:21605391421 (20.1 GiB)
Interruption:14 Adresse de base:0x3000

eth1 Link encap:Ethernet HWaddr 08:00:20:a1:4b:33
inet adr:192.168.254.1 Bcast:192.168.254.255
Masque:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:20207169 errors:0 dropped:0 overruns:0 frame:0
TX packets:17280402 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 lg file transmission:1000
RX bytes:19068335140 (17.7 GiB) TX bytes:8246313479 (7.6 GiB)
Interruption:24 Adresse de base:0x1800

eth2 Link encap:Ethernet HWaddr 00:04:75:df:1c:6d
inet adr:192.168.253.1 Bcast:192.168.253.255
Masque:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1843643 errors:0 dropped:0 overruns:0 frame:0
TX packets:2416959 errors:13 dropped:0 overruns:0 carrier:0
collisions:0 lg file transmission:1000
RX bytes:157416047 (150.1 MiB) TX bytes:2313298605 (2.1 GiB)
Interruption:17 Adresse de base:0x8000

lo Link encap:Boucle locale
inet adr:127.0.0.1 Masque:255.0.0.0
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:7839862 errors:0 dropped:0 overruns:0 frame:0
TX packets:7839862 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 lg file transmission:0
RX bytes:3713209874 (3.4 GiB) TX bytes:3713209874 (3.4 GiB)

Interruptions:
CPU0 CPU2
0: 1253580857 1253580260 <NULL> timer
1: 0 0 sun4u PSYCHO_PCIERR
2: 0 0 sun4u PSYCHO_UE
3: 0 0 sun4u PSYCHO_CE
8: 733411 0 sun4u su(kbd)
9: 0 4396224 sun4u su(mouse)
10: 0 0 sun4u parport0
11: 4 0 sun4u floppy
12: 0 0 sun4u cs4231(capture)
13: 0 0 sun4u cs4231(play)
14: 0 37976886 sun4u eth0
15: 0 218660455 sun4u sym53c8xx
16: 30 0 sun4u sym53c8xx
17: 2042976 2011664 sun4u eth2
18: 137883796 0 sun4u aic7xxx
19: 0 1208028 sun4u ohci_hcd:usb2
20: 0 650947 sun4u ohci_hcd:usb3
21: 1 4 sun4u ehci_hcd:usb1
22: 0 0 sun4u PSYCHO_PCIERR
24: 4957716 33460983 sun4u eth1

Any idea ?

Regards,

JKB
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/