Re: sky2 rx length errors

From: Grozdan
Date: Sun Sep 20 2009 - 14:47:08 EST


2009/9/20 Willy Tarreau <w@xxxxxx>:
> Hi guys,
>
> On Sun, Sep 20, 2009 at 08:16:02PM +0200, Grozdan wrote:
>> 2009/9/20 Stephen Hemminger <shemminger@xxxxxxxxxx>:
>>
>> >
>> > This error status occurs if the length reported by the PHY does not
>> > match the len reported by the DMA engine.  The error status is:
>> >   0x4420100 = length 1090 + broadcast packet...
>> >
>> > No idea what is on your network, but perhaps there is some MTU confusion?
>> > Since martian destination seems related, knowing more about that packet
>> > might help.
>> >
>>
>> Hi,
>>
>> Thanks for the reply. There's nothing on my home network here. It is
>> just a direct connection from my PC to my cable modem and there's
>> nothing in between. I've googled a bit and it seems others also
>> encounter this problem.
>
> I've encountered similar issues on early 8053 chips too. Those were
> soldered on motherboard of network servers bought about 4 years ago.
> No matter what trick I could try, change drivers, enable/disable flow
> control, change negociation speed, etc... the PHY would occasionally
> and randomly get mad and start shifting received frames by a few bytes,
> thus causing loss of network connectivity. The logs would also display
> martians, depending on the bytes in the frame which appeared in the
> IP header once shifted.
>
> Sometimes it would automatically get back after a chip reset, sometimes
> not. It seemed that disabling flow control helped a bit, but it was not
> fantastic. It would randomly hang every 1-30 days, which made the issue
> rather hard to debug.
>
> I don't precisely remember the rev. of the chip, but I remember that
> it was pretty old and that more recent machines had a much larger
> number that never exhibited the issue. Also, my desktop right here
> runs off a 88E8056 (~= two 8053s) and has never failed yet.
>
> So I really think that there was a horrible batch of chips in its
> early days.
>
>> I've read a few posts on the Ubuntu bugzilla
>> where people change the MTU from 1500 to 1492 and this fixes the
>> problem. However, even with this, some report that the problem is
>> still there. I did the same and it didn't change anything for me.
>
> Did not help for me either.
>
>> So I
>> disabled my onboard NIC and added a 3Com one which has been working
>> perfectly so far and I think I'll just keep using it instead of the
>> Marvell one.
>
> That's the best you can do if you happen to have one of those buggy
> chips. We had to stuff intel NICs in the servers causing trouble at
> the customer's and it solved the issue too.
>
> Regards,
> Willy
>
>

Thanks Willy :)

What I'm still wondering a bit though is the fact that I've never seen
it behave like that for the past 3 years I've been using it. Only
recently, with upgrading my kernel to 2.6.30 and later on to 2.6.31
(self-compiled, sources taken from the openSUSE build service) it
started to behave like that. In the past I also used older kernels (of
course) like 2.6.27.x and 2.6.29 and never encountered this. So I'm a
bit uncertain as to whether it's actually something in the kernel that
makes it behave like that or that there's a HW problem that suddenly
occurred or got exposed...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/