Re: Regression: sky2 kernel between 3.1 and 3.2.1 (last known good3.0.9)

From: Michael Breuer
Date: Fri Jan 20 2012 - 11:18:10 EST


On 1/20/2012 11:10 AM, Stephen Hemminger wrote:
On Fri, 20 Jan 2012 09:24:38 -0500
Michael Breuer<mbreuer@xxxxxxxxxx> wrote:

On 1/16/2012 11:39 AM, Michael Breuer wrote:
Synopsis:

Receiving DMAR and other errors after approximately three days of
uptime. The symptoms exactly match errors seen and then fixed around
2.6.32.4.

While the system remains unaffected for too long to do a bisect, I was
able to confirm that the problem exists in the 3.1 stable branch (I
jumped from 3.0 to 3.2 when 3.2. was released).

For now I reverted to the sky2.c from 3.0.9 and am running the rest of
the kernel from 3.1.2, but won't be certain that this works until
later in the week.

Note that 20 seconds prior to the log extract below were DHCP renewal
attempts on eth1, the issue below was on eth0. Not sure it's relevant,
however back in 2010 a preceding DHCP event did turn out to be
relevant to the manifestation of the bug.

The 3.2.1-dirty I'm running is from git with a single local patch -
for sidewinder force-feedback support (shouldn't be relevant to the
sky2 issue).

Log extract:

Jan 16 05:49:46 mail kernel: [198230.628919] DRHD: handling fault
status reg 2
Jan 16 05:49:46 mail kernel: [198230.628925] sky2 0000:06:00.0: error
interrupt status=0x80000000
Jan 16 05:49:46 mail kernel: [198230.628929] DMAR:[DMA Read] Request
device [06:00.0] fault addr fff78000
Jan 16 05:49:46 mail kernel: [198230.628931] DMAR:[fault reason 06]
PTE Read access is not set
Jan 16 05:49:46 mail kernel: [198230.628939] sky2 0000:06:00.0: PCI
hardware error (0x2010)
Jan 16 05:49:53 mail dhclient[1616]: DHCPREQUEST on eth1 to
10.240.184.29 port 67
Jan 16 05:50:01 mail kernel: [198246.288400] ------------[ cut here
]------------
Jan 16 05:50:01 mail kernel: [198246.288408] WARNING: at
net/sched/sch_generic.c:255 dev_watchdog+0x247/0x250()
Jan 16 05:50:01 mail kernel: [198246.288411] Hardware name: System
Product Name
Jan 16 05:50:01 mail kernel: [198246.288413] NETDEV WATCHDOG: eth0
(sky2): transmit queue 0 timed out
Jan 16 05:50:01 mail kernel: [198246.288415] Modules linked in: tcp_lp
cpufreq_stats ebtable_nat ebtables nf_conntrack_netbios_ns
nf_conntrack_broadcast ip6table_mangle ip6table_filter ip6_tables
iptable_mangle ipt_MASQUERADE iptable_nat nf_nat iptable_raw tun
bridge stp llc lockd sit tunnel4 ipt_LOG nf_conntrack_ftp
nf_conntrack_ipv6 nf_defrag_ipv6 xt_CHECKSUM xt_multiport xt_DSCP
w83627ehf xt_mark xt_dscp hwmon_vid binfmt_misc raid1 btrfs sunrpc
zlib_deflate libcrc32c snd_hda_codec_analog snd_ens1371 gameport
snd_hda_intel snd_rawmidi snd_ac97_codec snd_hda_codec snd_hwdep
ac97_bus snd_seq snd_seq_device snd_pcm gspca_spca505 snd_timer
gspca_main snd videodev media soundcore i2c_i801 iTCO_wdt microcode
v4l2_compat_ioctl32 snd_page_alloc i7core_edac sky2 edac_core pcspkr
iTCO_vendor_support virtio_net virtio virtio_ring kvm_intel kvm uinput
ipv6 raid456 async_raid6_recov async_pq raid6_pq async_xor
firewire_ohci firewire_core pata_acpi ata_generic xor async_memcpy
async_tx crc_itu_t pata_marvell nouveau ttm d
Jan 16 05:50:01 mail kernel: rm_kms_helper drm i2c_algo_bit i2c_core
mxm_wmi video [last unloaded: nf_conntrack_broadcast]
Jan 16 05:50:01 mail kernel: [198246.288487] Pid: 0, comm: swapper/0
Tainted: G W 3.2.1-dirty #1
Jan 16 05:50:01 mail kernel: [198246.288489] Call Trace:
Jan 16 05:50:01 mail kernel: [198246.288491]<IRQ>
[<ffffffff81050a4f>] warn_slowpath_common+0x7f/0xc0
Jan 16 05:50:01 mail kernel: [198246.288501] [<ffffffff8101f0bd>] ?
lapic_next_event+0x1d/0x30
Jan 16 05:50:01 mail kernel: [198246.288504] [<ffffffff81050b46>]
warn_slowpath_fmt+0x46/0x50
Jan 16 05:50:01 mail kernel: [198246.288509] [<ffffffff81009319>] ?
read_tsc+0x9/0x20
Jan 16 05:50:01 mail kernel: [198246.288513] [<ffffffff814a81e7>]
dev_watchdog+0x247/0x250
Jan 16 05:50:01 mail kernel: [198246.288518] [<ffffffff8105fbbb>]
run_timer_softirq+0x12b/0x3b0
Jan 16 05:50:01 mail kernel: [198246.288521] [<ffffffff814a7fa0>] ?
qdisc_reset+0x50/0x50
Jan 16 05:50:01 mail kernel: [198246.288525] [<ffffffff81057d18>]
__do_softirq+0xa8/0x210
Jan 16 05:50:01 mail kernel: [198246.288529] [<ffffffff8157496c>]
call_softirq+0x1c/0x30
Jan 16 05:50:01 mail kernel: [198246.288533] [<ffffffff810041e5>]
do_softirq+0x65/0xa0
Jan 16 05:50:01 mail kernel: [198246.288536] [<ffffffff810580fe>]
irq_exit+0x8e/0xb0
Jan 16 05:50:01 mail kernel: [198246.288539] [<ffffffff815750a3>]
do_IRQ+0x63/0xe0
Jan 16 05:50:01 mail kernel: [198246.288543] [<ffffffff8156ad2e>]
common_interrupt+0x6e/0x6e
Jan 16 05:50:01 mail kernel: [198246.288545]<EOI>
[<ffffffff81307b6d>] ? intel_idle+0xed/0x150
Jan 16 05:50:01 mail kernel: [198246.288551] [<ffffffff81307b4f>] ?
intel_idle+0xcf/0x150
Jan 16 05:50:01 mail kernel: [198246.288555] [<ffffffff8144d331>]
cpuidle_idle_call+0xc1/0x280
Jan 16 05:50:01 mail kernel: [198246.288559] [<ffffffff8100122a>]
cpu_idle+0xca/0x120
Jan 16 05:50:01 mail kernel: [198246.288563] [<ffffffff8154741e>]
rest_init+0x72/0x74
Jan 16 05:50:01 mail kernel: [198246.288568] [<ffffffff81b6abdd>]
start_kernel+0x3b5/0x3c0
Jan 16 05:50:01 mail kernel: [198246.288572] [<ffffffff81b6a32b>]
x86_64_start_reservations+0x132/0x136
Jan 16 05:50:01 mail kernel: [198246.288576] [<ffffffff81b6a140>] ?
early_idt_handlers+0x140/0x140
Jan 16 05:50:01 mail kernel: [198246.288580] [<ffffffff81b6a431>]
x86_64_start_kernel+0x102/0x111
Jan 16 05:50:01 mail kernel: [198246.288583] ---[ end trace
bb26011d21a2b1d7 ]---
Jan 16 05:50:01 mail kernel: [198246.288586] sky2 0000:06:00.0: eth0:
tx timeout
Jan 16 05:50:01 mail kernel: [198246.288593] sky2 0000:06:00.0: eth0:
transmit ring 115 .. 10 report=115 done=115



FYI - I've been up for four days now without issues running on 3.2.1 +
sky2.c from 3.0.9. Looks like the issue is in fact in one of the
modifications made in sky2.c between those two releases.
Since only you seem to be able to reproduce it, most likely the
bisect burden will be on you. If you know it is only one file,
then bisecting that file is fairly quick.

As of now, I have no reliable way to reproduce... so this is likely to take about 3-4 days per bisect run... more if it doesn't fail.

If there are suggestions as to diagnostic code to put in; or specific bias towards one version or another that may reduce the time significantly.

I've also got some windows where I have to leave a stable version up.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/