[BUG] tg3 crash during RAID5 hot-add (2.4.20 and 2.4.21-rc3)

From: Richard M. van Hees (R.M.van.Hees@sron.nl)
Date: Mon May 26 2003 - 08:05:29 EST


Hi,

My 3COM996 NIC went repeatedly down during a hot-add of a disk to a software
RAID 5 configuration (4 HD on a RocketRAID 404, chipset: hpt374). I have tried
to recover my RAID system using the standard kernel source 2.4.20 & 2.4.21-rc2
(both with the buildin tg3 driver), and 2.4.21-rc3 (bcm5700-5.0.5 driver as
module). The NIC goes down after about 10 - 15 min, when someone tries to scp a
few large files (300 MB). My system is a P4 (2GHz), i850, Mem: 2 GB. System
software is installed on separate disk (not RAID system).

Thanks for any help...

Best regards, Richard van Hees

All I find in the kernel-log is:
May 26 12:14:56 sron0030 kernel: Broadcom Gigabit Ethernet Driver bcm5700 with
Broadcom NIC Extension (NICE) ver. 5.0.5 (09/24/02)
May 26 12:14:56 sron0030 kernel: PCI: Found IRQ 7 for device 02:0a.0
May 26 12:14:56 sron0030 kernel: eth0: 3Com 3C996 10/100/1000 Server NIC found
at mem feae0000, IRQ 7, node addr 0004763b5a20
May 26 12:14:56 sron0030 kernel: eth0: Broadcom BCM5401 Copper transceiver found
May 26 12:14:56 sron0030 kernel: eth0: Scatter-gather ON, 64-bit DMA ON, Tx
Checksum ON, Rx Checksum ON, 802.1Q VLAN ON
May 26 12:14:56 sron0030 kernel: bcm5700: eth0 NIC Link is UP, 1000 Mbps full
duplex
May 26 12:23:15 sron0030 kernel: md: trying to hot-add hdi1 to md0 ...
May 26 12:23:15 sron0030 kernel: md: bind<hdi1,4>
May 26 12:23:15 sron0030 kernel: RAID5 conf printout:
May 26 12:23:15 sron0030 kernel: --- rd:4 wd:3 fd:1
May 26 12:23:15 sron0030 kernel: disk 0, s:0, o:1, n:0 rd:0 us:1 dev:hde1
May 26 12:23:15 sron0030 kernel: disk 1, s:0, o:1, n:1 rd:1 us:1 dev:hdg1
May 26 12:23:15 sron0030 kernel: disk 2, s:0, o:0, n:2 rd:2 us:1 dev:[dev
00:00]
May 26 12:23:15 sron0030 kernel: disk 3, s:0, o:1, n:3 rd:3 us:1 dev:hdk1
May 26 12:23:15 sron0030 kernel: RAID5 conf printout:
May 26 12:23:15 sron0030 kernel: --- rd:4 wd:3 fd:1
May 26 12:23:15 sron0030 kernel: disk 0, s:0, o:1, n:0 rd:0 us:1 dev:hde1
May 26 12:23:15 sron0030 kernel: disk 1, s:0, o:1, n:1 rd:1 us:1 dev:hdg1
May 26 12:23:15 sron0030 kernel: disk 2, s:0, o:0, n:2 rd:2 us:1 dev:[dev
00:00]
May 26 12:23:15 sron0030 kernel: disk 3, s:0, o:1, n:3 rd:3 us:1 dev:hdk1
May 26 12:23:15 sron0030 kernel: md: updating md0 RAID superblock on device
May 26 12:23:15 sron0030 kernel: md: hdi1 [events: 00000026]<6>(write) hdi1's
sb offset: 160079552
May 26 12:23:15 sron0030 kernel: md: hdk1 [events: 00000026]<6>(write) hdk1's
sb offset: 160079552
May 26 12:23:15 sron0030 kernel: md: hdg1 [events: 00000026]<6>(write) hdg1's
sb offset: 160079552
May 26 12:23:15 sron0030 kernel: md: hde1 [events: 00000026]<6>(write) hde1's
sb offset: 160079552
May 26 12:23:15 sron0030 kernel: md: recovery thread got woken up ...
May 26 12:23:15 sron0030 kernel: md0: resyncing spare disk hdi1 to replace
failed disk
May 26 12:23:15 sron0030 kernel: RAID5 conf printout:
May 26 12:23:15 sron0030 kernel: --- rd:4 wd:3 fd:1
May 26 12:23:15 sron0030 kernel: disk 0, s:0, o:1, n:0 rd:0 us:1 dev:hde1
May 26 12:23:15 sron0030 kernel: disk 1, s:0, o:1, n:1 rd:1 us:1 dev:hdg1
May 26 12:23:15 sron0030 kernel: disk 2, s:0, o:0, n:2 rd:2 us:1 dev:[dev
00:00]
May 26 12:23:15 sron0030 kernel: disk 3, s:0, o:1, n:3 rd:3 us:1 dev:hdk1
May 26 12:23:15 sron0030 kernel: RAID5 conf printout:
May 26 12:23:15 sron0030 kernel: --- rd:4 wd:3 fd:1
May 26 12:23:15 sron0030 kernel: disk 0, s:0, o:1, n:0 rd:0 us:1 dev:hde1
May 26 12:23:15 sron0030 kernel: disk 1, s:0, o:1, n:1 rd:1 us:1 dev:hdg1
May 26 12:23:15 sron0030 kernel: disk 2, s:0, o:0, n:2 rd:2 us:1 dev:[dev
00:00]
May 26 12:23:15 sron0030 kernel: disk 3, s:0, o:1, n:3 rd:3 us:1 dev:hdk1
May 26 12:23:15 sron0030 kernel: md: syncing RAID array md0
May 26 12:23:15 sron0030 kernel: md: minimum _guaranteed_ reconstruction speed:
100 KB/sec/disc.
May 26 12:23:15 sron0030 kernel: md: using maximum available idle IO bandwith
(but not more than 100000 KB/sec) for reconstruction.
May 26 12:23:15 sron0030 kernel: md: using 124k window, over a total of
160079552 blocks.
May 26 12:36:39 sron0030 kernel: nfs: server mars not responding, still trying
May 26 12:36:46 sron0030 kernel: NETDEV WATCHDOG: eth0: transmit timed out
May 26 12:37:26 sron0030 kernel: NETDEV WATCHDOG: eth0: transmit timed out
May 26 12:40:22 sron0030 kernel: NETDEV WATCHDOG: eth0: transmit timed out
May 26 12:42:37 sron0030 kernel: RPC: sendmsg returned error 101
May 26 12:42:37 sron0030 kernel: nfs: RPC call returned error 101
May 26 12:42:37 sron0030 kernel: nfs_statfs: statfs error = 512



--
e-mail: Richard van Hees <R.M.van.Hees@xxxxxxx>
date : 26-May-2003 15:02:19
phone : +31-302535715
fax : +31-302540860
www : http://sron9864.sron.nl/~hees
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/