Re: sky2 panic in 2.6.32.1 under load (new oops)

From: Michael Breuer
Date: Sat Dec 26 2009 - 15:37:35 EST


On 12/26/2009 12:57 PM, Stephen Hemminger wrote:
On Fri, 25 Dec 2009 22:23:51 -0500
Michael Breuer<mbreuer@xxxxxxxxxx> wrote:

On 12/25/2009 6:22 PM, Stephen Hemminger wrote:
On Fri, 25 Dec 2009 11:28:55 -0500
Michael Breuer<mbreuer@xxxxxxxxxx> wrote:


More data points - I'm able to reliably recreate this now.
While I thought it was coincidence, each and every time I hit this issue
there is a DHCP renew event immediately before the first error.
The crash occurs while under load - in my case seems that the traffic is
actually IPV6 (hadn't noticed that before).
I ran nethogs on a remote display - the reported rx rate on the IPV6 smb
connection at the time of the lockup was 33889.688 KB/sec on a 1gbit
nic. I've got two events like this - don't recall if the earlier one was
the exact same # - but it was in the ballpark.

On 12/24/2009 2:01 AM, Andrew Morton wrote:

cc's added again.

On Wed, 23 Dec 2009 17:54:27 -0500 Michael Breuer<mbreuer@xxxxxxxxxx> wrote:



Ok - not the firmware. Ran another Windows backup and sky2 went down.

Nothing in dmesg.old - have oops in syslog. System became unresponsive
and watchdog kicked in after a minute.

Also note that I have a similar oops with VT-D disabled (posted here on
12/5). I'm attaching the oops from that below this oops for comparison.
That also happened under similar load.

On the assumption that I can recreate this (although it takes a while)
please let me know how I can help.

What's in my log (starting with an smbd error about 2 min before the
oops (note: the dchpd is not the system doing the backup).


This (nastily wordwrapped) oops appers to be quite different from
Berck's one.



What is the MTU?

1500

It looks like the problem only shows up for packets generated by DHCP,
and these come through AF_PACKET. The problem maybe related to how this
packets are fragmented into header and page, in a different way than other
packets confusing the driver or DMA engine.

Does this help?
-----

--- a/drivers/net/sky2.c 2009-12-26 09:50:20.869565022 -0800
+++ b/drivers/net/sky2.c 2009-12-26 09:55:54.620645355 -0800
@@ -1616,6 +1616,13 @@ static netdev_tx_t sky2_xmit_frame(struc
if (unlikely(tx_avail(sky2)< tx_le_req(skb)))
return NETDEV_TX_BUSY;

+ if (!pskb_may_pull(skb, ETH_HLEN)) {
+ if (net_ratelimit())
+ pr_info(PFX "%s: packet missing ether header (%d)?",
+ dev->name, skb->len);
+ goto drop;
+ }
+
len = skb_headlen(skb);
mapping = pci_map_single(hw->pdev, skb->data, len, PCI_DMA_TODEVICE);

@@ -1761,6 +1768,7 @@ mapping_unwind:
mapping_error:
if (net_ratelimit())
dev_warn(&hw->pdev->dev, "%s: tx mapping error\n", dev->name);
+drop:
dev_kfree_skb(skb);
return NETDEV_TX_OK;
}




That seems to have done the trick!

Still one odd message sequence, but no hangs or crashes.

The first time I forced a DHCP renew while running at high throughput, I got the same SMB errors I saw in my original error log (pre-crash). This only happened once:
Dec 26 15:24:18 mail dhcpd: DHCPACK on 10.0.0.56 to 00:1c:cc:f3:9f:f6 (BLACKBERRY-9542) via eth0
Dec 26 15:24:25 mail smbd[8937]: [2009/12/26 15:24:25, 0] lib/util_sock.c:1564(matchname)
Dec 26 15:24:25 mail smbd[8937]: matchname: host name/address mismatch: ::ffff:10.0.0.11 != potter.majjas.com
Dec 26 15:24:25 mail smbd[8937]: [2009/12/26 15:24:25, 0] lib/util_sock.c:1685(get_peer_name)
Dec 26 15:24:25 mail smbd[8937]: Matchname failed on potter.majjas.com ::ffff:10.0.0.11
Dec 26 15:24:25 mail smbd[8937]: [2009/12/26 15:24:25, 0] smbd/nttrans.c:2076(call_nt_transact_ioctl)
Dec 26 15:24:25 mail smbd[8937]: call_nt_transact_ioctl(0x900eb): Currently not implemented.

I would discount this, but the same sequence was present in the logs pre-crash as well. I do not see this at all absent the preceding DHCP renew sequence. I also don't see this unless the adapter is under load.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/