Re: Strange errors from e1000 driver (2.6.18)

From: Jesse Brandeburg
Date: Sun Oct 22 2006 - 18:15:48 EST


Analysis follows, but I wanted to ask you to bisect back if you can to
find the apparent patch to make the difference. Basically at this
point I'd say its not likely to be an e1000 issue, but I'd like to
follow up and make sure.

On 10/22/06, Martin J. Bligh <mbligh@xxxxxxxxxx> wrote:
0000:00:0a.0 Ethernet controller: Intel Corporation 82546EB Gigabit
Ethernet Con
troller (Copper) (rev 01)
Latency: 32 (63750ns min), Cache Line Size: 0x08 (32 bytes)
Interrupt: pin A routed to IRQ 5
Region 0: Memory at ef020000 (64-bit, non-prefetchable) [size=128K]
Region 4: I/O ports at a000 [size=64]
Capabilities: [dc] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot
+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [e4] Capabilities: [f0] Message Signalled
Interrupts:
64bit+ Queue=0/0 Enable-
Address: 0000000000000000 Data: 0000

0000:00:0a.1 Ethernet controller: Intel Corporation 82546EB Gigabit
Ethernet Con
troller (Copper) (rev 01)
Subsystem: Intel Corporation PRO/1000 MT Dual Port Server Adapter
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Step
ping- SERR- FastB2B-
Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium
>TAbort- <TAbort
- <MAbort- >SERR- <PERR-
Latency: 32 (63750ns min), Cache Line Size: 0x08 (32 bytes)
Interrupt: pin B routed to IRQ 11
Region 0: Memory at ef000000 (64-bit, non-prefetchable) [size=128K]
Region 4: I/O ports at a400 [size=64]
Capabilities: [dc] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
PME(D0+,D1-,D2-,D3hot
+,D3cold+)
Status: D0 PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [e4] Capabilities: [f0] Message Signalled
Interrupts:
64bit+ Queue=0/0 Enable-
Address: 0000000000000000 Data: 0000

Nothing seems out of order, but the latency may be low, I'd be curious
what these looked like before with the old kernel. Some of the other
things to compare would have been the lspci -vv output from your
chipset with old/new kernel, in case the bridge/system configuration
changed. There are no known problems right now with this chipset
82546EB

> cat /proc/interrupts,

CPU0
5: 1975991 XT-PIC ehci_hcd:usb2, VIA8237, eth0
NMI: 0
LOC: 146264664
ERR: 52805

shared int, fine, but whats with the ERR: ?

> dmesg

Did that bit already.

except you didn't include any of the e1000 load information nor the
system's boot information as it came up.

> ethtool -e eth0,

root@titus:/usr/local/autotest/bin # ethtool -e eth0
Offset Values
------ ------
0x0000 00 07 e9 09 0b 08 30 05 ff ff ff ff ff ff ff ff
0x0010 44 a9 03 98 0b 46 11 10 86 80 10 10 86 80 68 34
0x0020 0c 00 10 10 00 00 02 21 c8 18 ff ff ff ff ff ff
0x0030 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0040 0c c3 61 78 04 50 02 21 c8 08 ff ff ff ff ff ff
0x0050 ff ff ff ff ff ff ff ff ff ff ff ff ff ff 02 06
0x0060 2c 00 00 40 07 11 00 00 2c 00 00 40 ff ff ff ff
nothing out of order here that I can see immediately.

> and maybe output of
> dmidecode, etc.

Attached.

> only a little. There are so many different pieces of e1000 hardware
> and so few specifics in this report that I'll be able to tell you lots
> more when you get us the info requested.

Thanks. Not sure if the bug wasn't there in earlier kernels, or if we
just weren't printing anything.

I think it may not have been in earlier kernels, but I also don't
think this is an e1000 problem, at least initially.

<snip>
BIOS Information
Vendor: Phoenix Technologies, LTD
Version: 6.00 PG
Release Date: 09/13/2004
Address: 0xE0000
Runtime Size: 128 kB
ROM Size: 256 kB
<snip>

Handle 0x0001, DMI type 1, 25 bytes.
System Information
Manufacturer: VIA Technologies, Inc.
Product Name: KT600-8237

Handle 0x0002, DMI type 2, 8 bytes.
Base Board Information
Manufacturer:
Product Name: KT600-8237
Version:
Serial Number:

This chipset is one of the most frequent common elements in problem
reports of TX hangs for e1000. My current theory (we've bought a
bunch of these systems and never reproduced the issue) is that there
is something either design specific or BIOS specific that causes this
chipset to interact very badly with e1000 hardware. Some systems have
the issue and some don't. If you could bisect back to a working point
it would be interesting to see where that pointed.

Handle 0x0004, DMI type 4, 32 bytes.
Processor Information
Socket Designation: Socket A
Type: Central Processor
Family: Athlon XP
Manufacturer: AMD
ID: A0 06 00 00 FF FB 83 03
Signature: Family 6, Model A, Stepping 0
Version: AMD Athlon(tm) XP
Voltage: 1.5 V
External Clock: 100 MHz
Max Speed: 3000 MHz
Current Speed: 1100 MHz

doesn't seem you're overclocked. Good.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/