Re: sky2 panic in 2.6.32.1 under load (new oops)

From: Michael Breuer
Date: Sun Dec 27 2009 - 12:04:55 EST


On 12/26/2009 3:37 PM, Michael Breuer wrote:
On 12/26/2009 12:57 PM, Stephen Hemminger wrote:
On Fri, 25 Dec 2009 22:23:51 -0500
Michael Breuer<mbreuer@xxxxxxxxxx> wrote:

On 12/25/2009 6:22 PM, Stephen Hemminger wrote:
On Fri, 25 Dec 2009 11:28:55 -0500
Michael Breuer<mbreuer@xxxxxxxxxx> wrote:


More data points - I'm able to reliably recreate this now.
While I thought it was coincidence, each and every time I hit this issue
there is a DHCP renew event immediately before the first error.
The crash occurs while under load - in my case seems that the traffic is
actually IPV6 (hadn't noticed that before).
I ran nethogs on a remote display - the reported rx rate on the IPV6 smb
connection at the time of the lockup was 33889.688 KB/sec on a 1gbit
nic. I've got two events like this - don't recall if the earlier one was
the exact same # - but it was in the ballpark.

On 12/24/2009 2:01 AM, Andrew Morton wrote:

cc's added again.

On Wed, 23 Dec 2009 17:54:27 -0500 Michael Breuer<mbreuer@xxxxxxxxxx> wrote:



Ok - not the firmware. Ran another Windows backup and sky2 went down.

Nothing in dmesg.old - have oops in syslog. System became unresponsive
and watchdog kicked in after a minute.

Also note that I have a similar oops with VT-D disabled (posted here on
12/5). I'm attaching the oops from that below this oops for comparison.
That also happened under similar load.

On the assumption that I can recreate this (although it takes a while)
please let me know how I can help.

What's in my log (starting with an smbd error about 2 min before the
oops (note: the dchpd is not the system doing the backup).


This (nastily wordwrapped) oops appers to be quite different from
Berck's one.



What is the MTU?

1500

It looks like the problem only shows up for packets generated by DHCP,
and these come through AF_PACKET. The problem maybe related to how this
packets are fragmented into header and page, in a different way than other
packets confusing the driver or DMA engine.

Does this help?
-----

--- a/drivers/net/sky2.c 2009-12-26 09:50:20.869565022 -0800
+++ b/drivers/net/sky2.c 2009-12-26 09:55:54.620645355 -0800
@@ -1616,6 +1616,13 @@ static netdev_tx_t sky2_xmit_frame(struc
if (unlikely(tx_avail(sky2)< tx_le_req(skb)))
return NETDEV_TX_BUSY;

+ if (!pskb_may_pull(skb, ETH_HLEN)) {
+ if (net_ratelimit())
+ pr_info(PFX "%s: packet missing ether header (%d)?",
+ dev->name, skb->len);
+ goto drop;
+ }
+
len = skb_headlen(skb);
mapping = pci_map_single(hw->pdev, skb->data, len, PCI_DMA_TODEVICE);

@@ -1761,6 +1768,7 @@ mapping_unwind:
mapping_error:
if (net_ratelimit())
dev_warn(&hw->pdev->dev, "%s: tx mapping error\n", dev->name);
+drop:
dev_kfree_skb(skb);
return NETDEV_TX_OK;
}




That seems to have done the trick!

Still one odd message sequence, but no hangs or crashes.

The first time I forced a DHCP renew while running at high throughput, I got the same SMB errors I saw in my original error log (pre-crash). This only happened once:
Dec 26 15:24:18 mail dhcpd: DHCPACK on 10.0.0.56 to 00:1c:cc:f3:9f:f6 (BLACKBERRY-9542) via eth0
Dec 26 15:24:25 mail smbd[8937]: [2009/12/26 15:24:25, 0] lib/util_sock.c:1564(matchname)
Dec 26 15:24:25 mail smbd[8937]: matchname: host name/address mismatch: ::ffff:10.0.0.11 != potter.majjas.com
Dec 26 15:24:25 mail smbd[8937]: [2009/12/26 15:24:25, 0] lib/util_sock.c:1685(get_peer_name)
Dec 26 15:24:25 mail smbd[8937]: Matchname failed on potter.majjas.com ::ffff:10.0.0.11
Dec 26 15:24:25 mail smbd[8937]: [2009/12/26 15:24:25, 0] smbd/nttrans.c:2076(call_nt_transact_ioctl)
Dec 26 15:24:25 mail smbd[8937]: call_nt_transact_ioctl(0x900eb): Currently not implemented.

I would discount this, but the same sequence was present in the logs pre-crash as well. I do not see this at all absent the preceding DHCP renew sequence. I also don't see this unless the adapter is under load.



Just an FYI - I left stuff running overnight and saw lots of this this AM... ans one from last night:
Dec 26 21:39:54 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008
Dec 27 05:09:39 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008
Dec 27 05:09:39 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008
Dec 27 05:09:39 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008
Dec 27 05:09:39 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008
Dec 27 05:09:41 mail kernel: sky2 0000:06:00.0: error interrupt status=0x8
Dec 27 05:09:41 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008
Dec 27 05:09:41 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008
Dec 27 05:09:41 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008
Dec 27 05:09:42 mail kernel: sky2 0000:06:00.0: error interrupt status=0x40000008
Dec 27 05:09:44 mail kernel: __ratelimit: 7 callbacks suppressed

... repeats for a couple of minutes. Never saw this before, but then again, the system didn't stay up before.

I've also got this in dmesg (command, not /var/log/dmesg):

sky2 eth0: rx error, status 0x10660010 length 1518
sky2 eth0: rx error, status 0x10c60010 length 1518
sky2 eth0: rx error, status 0x14860010 length 1518
sky2 eth0: rx error, status 0x19e60010 length 1518
sky2 eth0: rx error, status 0x23360010 length 1518
sky2 eth0: rx error, status 0x23360010 length 1518
sky2 eth0: rx error, status 0x23360010 length 1518
sky2 eth0: rx error, status 0x10660010 length 1518
sky2 eth0: rx error, status 0x10a60010 length 1518
sky2 eth0: rx error, status 0x10a60010 length 1518
sky2 eth0: rx error, status 0x11060010 length 1518
sky2 eth0: rx error, status 0x11060010 length 1518
DMA-API: debugging out of memory - disabling
... then later ...
sky2 eth0: rx error, status 0x23360010 length 1518
sky2 eth0: rx error, status 0x23360010 length 1518
sky2 eth0: rx error, status 0x23360010 length 1518
sky2 eth0: rx error, status 0x23360010 length 1518
sky2 eth0: rx error, status 0x23360010 length 1518
sky2 eth0: rx error, status 0x23360010 length 1518
sky2 eth0: rx error, status 0x23360010 length 1518
sky2 eth0: rx error, status 0x23360010 length 1518
DRHD: handling fault status reg 2
DMAR:[DMA Write] Request device [06:00.0] fault addr fff82da28000
DMAR:[fault reason 05] PTE Write access is not set
DRHD: handling fault status reg 102
DMAR:[DMA Write] Request device [06:00.0] fault addr fff82da03000
DMAR:[fault reason 05] PTE Write access is not set
DRHD: handling fault status reg 202
DMAR:[DMA Write] Request device [06:00.0] fault addr fff82d9ec000
DMAR:[fault reason 05] PTE Write access is not set
DRHD: handling fault status reg 302
DMAR:[DMA Write] Request device [06:00.0] fault addr fff82d9a0000
DMAR:[fault reason 05] PTE Write access is not set
DRHD: handling fault status reg 402
DMAR:[DMA Write] Request device [06:00.0] fault addr fff82d968000
DMAR:[fault reason 05] PTE Write access is not set
------------[ cut here ]------------
WARNING: at net/sched/sch_generic.c:261 dev_watchdog+0xf3/0x164()
Hardware name: System Product Name
NETDEV WATCHDOG: eth0 (sky2): transmit queue 0 timed out
Modules linked in: ip6table_mangle ip6table_filter ip6_tables iptable_raw iptable_mangle ipt_MASQUERADE iptable_nat nf_nat bridge stp appletalk psnap llc fbcon tileblit font bitblit softcursor nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc hwmon_vid coretemp acpi_cpufreq sit tunnel4 ipt_LOG nf_conntrack_netbios_ns nf_conntrack_ftp nf_conntrack_ipv6 xt_multiport xt_DSCP xt_dscp xt_MARK ipv6 dm_multipath kvm_intel kvm snd_hda_codec_analog snd_ens1371 gameport snd_rawmidi gspca_spca505 snd_hda_intel snd_ac97_codec gspca_main snd_hda_codec videodev v4l1_compat snd_hwdep ac97_bus snd_seq v4l2_compat_ioctl32 snd_seq_device snd_pcm firewire_ohci pcspkr firewire_core snd_timer snd soundcore snd_page_alloc wmi sky2 asus_atk0110 i2c_i801 crc_itu_t hwmon iTCO_wdt iTCO_vendor_support raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 ata_generic pata_acpi pata_marvell nouveau ttm drm_kms_helper drm agpgart fb i2c_algo_bit cfbcopyarea i2c_core cfbimgblt cfbfillrect [last unloaded: ip6_tables]
Pid: 25, comm: ksoftirqd/7 Tainted: G W 2.6.32-00830-gca45ac4-dirty #3
Call Trace:
<IRQ> [<ffffffff8105365a>] warn_slowpath_common+0x7c/0x94
[<ffffffff810536c9>] warn_slowpath_fmt+0x41/0x43
[<ffffffff813e2dd7>] ? netif_tx_lock+0x44/0x6c
[<ffffffff813e2f3f>] dev_watchdog+0xf3/0x164
[<ffffffff8146b76a>] ? _spin_unlock_irqrestore+0x29/0x41
[<ffffffff8103c62e>] ? check_preempt_curr_idle+0x15/0x17
[<ffffffff8104357d>] ? task_rq_unlock+0x11/0x13
[<ffffffff8104e050>] ? try_to_wake_up+0x2a2/0x2b4
[<ffffffff8106316b>] run_timer_softirq+0x1c8/0x270
[<ffffffff8105ae3b>] __do_softirq+0xf8/0x1cd
[<ffffffff81012e1c>] call_softirq+0x1c/0x30
<EOI> [<ffffffff810143a3>] do_softirq+0x4b/0xa6
[<ffffffff8105a730>] ksoftirqd+0x65/0xdc
[<ffffffff8105a6cb>] ? ksoftirqd+0x0/0xdc
[<ffffffff8107275c>] kthread+0x7f/0x87
[<ffffffff81012d1a>] child_rip+0xa/0x20
[<ffffffff8101267d>] ? restore_args+0x0/0x30
[<ffffffff810726dd>] ? kthread+0x0/0x87
[<ffffffff81012d10>] ? child_rip+0x0/0x20
---[ end trace 57f7151f6a5def07 ]---
sky2 eth0: tx timeout
sky2 eth0: transmit ring 27 .. 114 report=27 done=27
sky2 eth0: disabling interface
sky2 eth0: enabling interface
sky2 eth0: Link is up at 1000 Mbps, full duplex, flow control both
sky2 0000:06:00.0: error interrupt status=0x8
sky2 0000:06:00.0: error interrupt status=0x8
sky2 0000:06:00.0: error interrupt status=0x40000008
sky2 0000:06:00.0: error interrupt status=0x8
sky2 0000:06:00.0: error interrupt status=0x40000008
sky2 0000:06:00.0: error interrupt status=0x40000008
sky2 0000:06:00.0: error interrupt status=0x40000008
sky2 0000:06:00.0: error interrupt status=0x40000008
sky2 0000:06:00.0: error interrupt status=0x8
sky2 0000:06:00.0: error interrupt status=0x40000008
sky2 0000:06:00.0: error interrupt status=0x8
sky2 0000:06:00.0: error interrupt status=0x8
sky2 0000:06:00.0: error interrupt status=0x8
iint_free: readcount: 1
iint_free: opencount: 1
iint_free: readcount: 1
iint_free: opencount: 1
sky2 0000:06:00.0: error interrupt status=0x8
sky2 0000:06:00.0: error interrupt status=0x40000008
sky2 0000:06:00.0: error interrupt status=0x40000008
sky2 0000:06:00.0: error interrupt status=0x8
sky2 0000:06:00.0: error interrupt status=0x8
sky2 0000:06:00.0: error interrupt status=0x8
sky2 0000:06:00.0: error interrupt status=0x8
... this repeats for a while

I'm guessing that the dmesg and kernel messages correspond. System stayed up and everything seems ok now. There were connections open at the time of these events and they appear to have remained up.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/