2.6.20.4: NETDEV WATCHDOG and lockups

From: Christian Kujau
Date: Mon Apr 02 2007 - 16:11:43 EST



Hi there,

we have serious problems with 2 of our servers: both shiny new amd64 dual core, with both 2GB RAM, 32bit kernel+userland (Debian/testing).
Both servers have 2 NICs, RTL8139 (eth0, irq10) and RTL8169s
(eth1, irq11).

Both boxes are running fine but after "a while" they lock up and eventually restart all of a sudden. The last messages in the logfile are:

14:15:11 db2 kernel: NETDEV WATCHDOG: eth0: transmit timed out
14:15:14 db2 kernel: eth0: link up, 100Mbps, full-duplex, lpa 0x45E1

Then the box reboots, nothing else in the log.

As the servers have been set up recently, we only know that it happend with Debian's 2.6.17-? kernel. When we upgraded the installation, we went to 2.6.18-4-k7 and the problem persistent. We're using now vanilla 2.6.20.4 and while the problem persists, it takes longer to lockup (~20h as opposed to 4-5h). While this is a good thing for us, it's now harder
to reproduce (we have to wait longer).

Searching the archives turned up quite a few results but no real fix and lots of old postings too. We then disabled ACPI completely and booted with 'noapic'. Now both boxes are running for > 20h and we're curious how long they make it. However, booting with 'noapic' slowed down both servers *a lot*.

From /proc/interrupts we can see that only CPU0 (core 0) is handling
interrupts while CPU1 does not. We compiled with CONFIG_IRQBALANCE=n so that irqbalance(1) would work - but to no avail.

Please see http://nerdbynature.de/bits/2.6.20.4/ for details for both hosts and feel free to ask for more details. Although both boxes are in production we'll be happy test more bootoptions/patches and the like.

TIA,
Christian.
--
BOFH excuse #266:

All of the packets are empty.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/