Re: [PATCH] sky2: receive dma mapping error handling

From: Michael Breuer
Date: Sat Jan 30 2010 - 11:30:20 EST


On 1/28/2010 6:36 PM, Stephen Hemminger wrote:
Please try this patch (and only this patch), on 2.6.33-rc5[*];
none of the other patches that did not make it upstream because that
confuses things too much.

The code that checks for DMA mapping errors on receive buffers would
not handle errors correctly. I doubt you have these errors, but if you
did then it would explain the problems. The code has to be a little
tricky and build mapping for new rx buffer before releasing old one,
that way if new mapping fails, the old one can be reused.

If it works for you, I will resubmit with signed-off.
Nope - tx crash again. This time the system stayed up (but hosed) for a few hours. When I tried to recover eth0, the system crashed.
Brief summary of events (log extract below):

System start Jan 28 19:29
Everything seemed good (load and all) until 17:13:11 the following day when I got rx errors:

Jan 29 17:13:11 mail kernel: sky2 eth0: rx error, status 0x6230010 length 1518
Jan 29 17:13:11 mail kernel: sky2 eth0: rx error, status 0x7f40010 length 1518
Jan 29 17:13:11 mail kernel: sky2 eth0: rx error, status 0x8180010 length 1518
Jan 29 17:13:12 mail kernel: sky2 eth0: rx error, status 0x7f40010 length 1518
Jan 29 17:13:12 mail kernel: sky2 eth0: rx error, status 0x6230010 length 1518
Jan 29 17:13:12 mail kernel: sky2 eth0: rx error, status 0x8180010 length 1518
Jan 29 17:13:12 mail kernel: sky2 eth0: rx error, status 0x6230010 length 1518
Jan 29 17:13:12 mail kernel: sky2 eth0: rx error, status 0x7f40010 length 1518
Jan 29 17:13:12 mail kernel: sky2 eth0: rx error, status 0x8180010 length 1518
Jan 29 17:13:14 mail kernel: sky2 eth0: rx error, status 0x5f60010 length 1518

The system continued running normally after this until this morning (Jan 30) at 0:44:55:
Jan 30 05:44:55 mail kernel: DRHD: handling fault status reg 2
Jan 30 05:44:55 mail kernel: DMAR:[DMA Read] Request device [06:00.0] fault addr ffc4331ff000
Jan 30 05:44:55 mail kernel: DMAR:[fault reason 06] PTE Read access is not set
Jan 30 05:44:55 mail kernel: net_ratelimit: 2 callbacks suppressed
Jan 30 05:44:55 mail kernel: sky2 0000:06:00.0: error interrupt status=0xc0000000
Jan 30 05:44:55 mail kernel: sky2 0000:06:00.0: PCI hardware error (0x2010)
Jan 30 05:45:01 mail kernel: ------------[ cut here ]------------
Jan 30 05:45:01 mail kernel: WARNING: at net/sched/sch_generic.c:255 dev_watchdog+0xf3/0x161()
Jan 30 05:45:01 mail kernel: Hardware name: System Product Name
Jan 30 05:45:01 mail kernel: NETDEV WATCHDOG: eth0 (sky2): transmit queue 0 timed out
Jan 30 05:45:01 mail kernel: Modules linked in: iptable_raw iptable_mangle ipt_MASQUERADE iptable_nat nf_nat cpufreq_stats ip6table_filter ip6table_mangle ip6_tables bridge stp appletalk psnap llc nfsd lockd nfs_acl auth_rpcgss exportfs hwmon_vid coretemp sunrpc acpi_cpufreq sit tunnel4 ipt_LOG nf_conntrack_netbios_ns nf_conntrack_ftp xt_DSCP xt_dscp xt_MARK nf_conntrack_ipv6 xt_multiport ipv6 dm_multipath kvm_intel kvm snd_hda_codec_analog snd_hda_intel snd_ens1371 gameport snd_hda_codec snd_rawmidi snd_ac97_codec gspca_spca505 ac97_bus gspca_main snd_hwdep videodev snd_seq snd_seq_device v4l1_compat snd_pcm v4l2_compat_ioctl32 snd_timer snd soundcore snd_page_alloc firewire_ohci pcspkr i2c_i801 firewire_core wmi asus_atk0110 crc_itu_t sky2 hwmon iTCO_wdt iTCO_vendor_support fbcon tileblit font bitblit softcursor raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx raid1 ata_generic pata_acpi pata_marvell nouveau ttm drm_kms_helper drm agpgart fb i2c_algo_bit cfbcopyarea i2c_core cf
Jan 30 05:45:01 mail kernel: bimgblt cfbfillrect [last unloaded: nf_nat]
Jan 30 05:45:01 mail kernel: Pid: 0, comm: swapper Tainted: G W 2.6.33-rc5WITHMMAPNODMARFORKTIPSKY2DMAMAP-00283-gd4d37bd-dirty #1
Jan 30 05:45:01 mail kernel: Call Trace:
Jan 30 05:45:01 mail kernel: <IRQ> [<ffffffff8104a03d>] warn_slowpath_common+0x7c/0x94
Jan 30 05:45:01 mail kernel: [<ffffffff8104a0ac>] warn_slowpath_fmt+0x41/0x43
Jan 30 05:45:01 mail kernel: [<ffffffff813d2f43>] ? netif_tx_lock+0x44/0x6c
Jan 30 05:45:01 mail kernel: [<ffffffff813d30ab>] dev_watchdog+0xf3/0x161
Jan 30 05:45:01 mail kernel: [<ffffffff8106a31f>] ? sched_clock_cpu+0x44/0xce
Jan 30 05:45:01 mail kernel: [<ffffffff8105761a>] run_timer_softirq+0x1c3/0x26b
Jan 30 05:45:01 mail kernel: [<ffffffff8105060c>] __do_softirq+0xf8/0x1cd
Jan 30 05:45:01 mail kernel: [<ffffffff8107192b>] ? tick_program_event+0x2a/0x2c
Jan 30 05:45:01 mail kernel: [<ffffffff8100ab1c>] call_softirq+0x1c/0x30
Jan 30 05:45:01 mail kernel: [<ffffffff8100c2b3>] do_softirq+0x4b/0xa3
Jan 30 05:45:01 mail kernel: [<ffffffff810501f8>] irq_exit+0x4a/0x8c
Jan 30 05:45:01 mail kernel: [<ffffffff81461859>] smp_apic_timer_interrupt+0x86/0x94
Jan 30 05:45:01 mail kernel: [<ffffffff8100a5d3>] apic_timer_interrupt+0x13/0x20
Jan 30 05:45:01 mail kernel: <EOI> [<ffffffff812afbd4>] ? acpi_idle_enter_bm+0x256/0x28a
Jan 30 05:45:01 mail kernel: [<ffffffff812afbcd>] ? acpi_idle_enter_bm+0x24f/0x28a
Jan 30 05:45:01 mail kernel: [<ffffffff8139574c>] cpuidle_idle_call+0x9e/0xfa
Jan 30 05:45:01 mail kernel: [<ffffffff81008c05>] cpu_idle+0xb4/0xf6
Jan 30 05:45:01 mail kernel: [<ffffffff81455d48>] start_secondary+0x201/0x242
Jan 30 05:45:01 mail kernel: ---[ end trace 57f7151f6a5def07 ]---
Jan 30 05:45:01 mail kernel: sky2 eth0: tx timeout
Jan 30 05:45:01 mail kernel: sky2 eth0: transmit ring 14 .. 102 report=14 done=14
Jan 30 05:45:01 mail kernel: sky2 eth0: disabling interface
Jan 30 05:45:01 mail kernel: sky2 eth0: enabling interface

This down/up continued for several hours until I intervened at about 10:05.

I saw that there was no eth0 connectivity, eth1 was ok. It appeard that eth0 was receiving traffic but unable to send. arpwatch was reporting bogons, DHCP showed many DISCOVER/OFFER pairs, no REQUEST/ACK. Pings to any system failed; arp showed incomplete for anything hanging off of eth0. arping also failed.
I manually stopped and started eth0 (ifconfig) and reset iptables (although eth0 has no filters).

As I started looking at logs, the system hung and rebooted. I'm up now with dma debug enabled, however as with 2.6.32.4 num_entries is dropping and I don't think that dma debug will remain enabled long enough to catch a crash.

So, as I see things, there are two issues here: 1) the TX hang post DMAR error and 2) the inability to recover the interface and subsequent system instability.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/