Re: Intel e1000e: eth0: Detected Tx Unit Hang

From: Soeren Sonnenburg
Date: Tue Jan 06 2009 - 02:35:06 EST


On Mon, 2009-01-05 at 16:38 -0800, Jesse Brandeburg wrote:
> On Thu, Jan 1, 2009 at 11:56 AM, Soeren Sonnenburg <kernel@xxxxxx> wrote:
> > Dear list,

Dear Jesse,

> > I just recently observed a strange problem with an onboard 82567LF-2
> > Intel ethernet controller. It completely stopped working and this
> > desktop machine required a powerdown to get it to work again. This
> > happened with 2.6.27.10. However, the machine was working for weeks
> > before using an older kernel version (2.6.27.*, no binary modules, intel
> > skyburg mainboard, e8400 c2d cpu).
> >
> > Relevant details follow, can someone make sense of this?
>
> what does cat /proc/interrupts say?

$ uptime
08:10:46 up 4 days, 11:34, 6 users, load average: 0.83, 0.87, 0.89

$ cat /proc/interrupts
CPU0 CPU1
0: 659 653 IO-APIC-edge timer
1: 1 1 IO-APIC-edge i8042
8: 0 1 IO-APIC-edge rtc0
9: 0 0 IO-APIC-fasteoi acpi
12: 2 1 IO-APIC-edge i8042
16: 5058866 5065629 IO-APIC-fasteoi uhci_hcd:usb3
17: 14 14 IO-APIC-fasteoi HDA Intel
18: 372 390 IO-APIC-fasteoi ehci_hcd:usb1,
uhci_hcd:usb5, uhci_hcd:usb8
19: 0 0 IO-APIC-fasteoi uhci_hcd:usb7
20: 7922691 7927065 IO-APIC-fasteoi saa7146 (0)
21: 33168354 33168257 IO-APIC-fasteoi uhci_hcd:usb4, saa7146
(1)
22: 244360273 244341840 IO-APIC-fasteoi HDA Intel, saa7146 (2)
23: 0 0 IO-APIC-fasteoi ehci_hcd:usb2,
uhci_hcd:usb6
509: 23282093 23290682 PCI-MSI-edge eth0
510: 23874175 23873081 PCI-MSI-edge ahci
NMI: 0 0 Non-maskable interrupts
LOC: 147852851 146009672 Local timer interrupts
RES: 758105 701724 Rescheduling interrupts
CAL: 1181574 1176862 function call interrupts
TLB: 230633 153469 TLB shootdowns
TRM: 0 0 Thermal event interrupts
THR: 0 0 Threshold APIC interrupts
SPU: 0 0 Spurious interrupts
ERR: 0

> please also include at least ethtool -e eth0 length 256,

# ethtool -e eth0 length 256

Offset Values
------ ------
0x0000 XX XX XX XX XX XX 00 08 ff ff 85 10 ff ff ff ff
0x0010 ff ff ff ff c3 10 01 50 86 80 cd 10 86 80 00 00
0x0020 01 0d 00 00 00 00 05 86 20 30 00 0a 00 00 07 8d
0x0030 84 06 40 2b 43 00 00 00 cc 10 ad ba cc 10 cd 10
0x0040 ad ba ce 10 ad ba ad ba 00 00 00 00 00 00 00 00
0x0050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0060 00 01 00 40 16 13 07 40 ff ff ff ff ff ff ff ff
0x0070 ff ff ff ff ff ff ff ff ff ff 00 01 ff ff 8b 2c
0x0080 20 60 1f 00 02 00 13 00 00 80 1d 00 ff 00 16 00
0x0090 22 99 18 00 11 20 17 00 dd dd 18 00 12 20 17 00
0x00a0 00 80 1d 00 00 00 1f 00 ff ff ff ff ff ff ff ff
0x00b0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x00c0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x00d0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x00e0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x00f0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

> do you have
> the IOMMU enabled? your full dmesg would let me know.

$ zgrep -i iommu /proc/config.gz
CONFIG_GART_IOMMU=y
CONFIG_CALGARY_IOMMU=y
CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT=y
CONFIG_AMD_IOMMU=y
CONFIG_IOMMU_HELPER=y
# CONFIG_IOMMU_DEBUG is not set

> also, please double check you're running the latest BIOS for your motherboard.

indeed there seems to be an update to version 1.06 for this mainboard
(DP45SG), so hopefully this goes away

$ dmesg | grep -i calg

Calgary: detecting Calgary via BIOS EBDA area
Calgary: Unable to locate Rio Grande table in EBDA - bailing!

$ dmesg | grep -i bug
pci 0000:00:1a.7: EHCI: BIOS handoff failed (BIOS bug?) 01010001
pci 0000:00:1d.7: EHCI: BIOS handoff failed (BIOS bug?) 01010001

even though I don't find any insight in the changelog:

http://downloadmirror.intel.com/17218/eng/SG_0106_ReleaseNotes.pdf

> > 00:19.0 Ethernet controller: Intel Corporation 82567LF-2 Gigabit Network Connection
> >
> > $ dmesg | grep relevant parts
> >
> > Intel(R) PRO/1000 Network Driver - version 7.3.20-k3-NAPI
> > e1000e: Intel(R) PRO/1000 Network Driver - 0.3.3.3-k6
> > 0000:00:19.0: eth0: Intel(R) PRO/1000 Network Connection
> > Intel(R) Gigabit Ethernet Network Driver - version 1.2.45-k2
> >
> > 0000:00:19.0: eth0: (PCI Express:2.5GB/s:Width x1) XX:XX:XX:XX:XX:XX
> > 0000:00:19.0: eth0: Intel(R) PRO/1000 Network Connection
> > 0000:00:19.0: eth0: MAC: 5, PHY: 8, PBA No: ffffff-0ff
> > ADDRCONF(NETDEV_UP): eth0: link is not ready
> > 0000:00:19.0: eth0: Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
> > ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> > [...]
> > [uptime of a couple of days, with lots of network i/o]
> > [...]
> > saa7146 (1) saa7146_i2c_writeout [irq]: timed out waiting for end of xfer
> > saa7146 (1) saa7146_i2c_writeout [irq]: timed out waiting for end of xfer
> > 0000:00:19.0: eth0: Detected Tx Unit Hang:
> > TDH <ff>
> > TDT <1>
> > next_to_use <1>
> > next_to_clean <ff>
> > buffer_info[next_to_clean]:
> > time_stamp <1104ec2c3>
> > next_to_watch <ff>
> > jiffies <1104ec8f0>
> > next_to_watch.status <0>
>
> <snip> lots of hangs...
>
> what is the frequency of the hangs? What kind of traffic are you using?

Very low frequency (I posted all there was after about 4hrs), traffic
with about 10MBit/s downloading data from the net.

As I said, this so far only happened once and the machine is up since
the middle/late of October (IIRC with 2.6.27.3 at that time and recently
since end of december .10).

Soeren
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/