Re: HDD problem, software bug, bios bug, or hardware ?

From: Adko Branil
Date: Mon Sep 03 2012 - 17:46:28 EST






> But now an interrupt occurs during the hard
>reset, and pdc_interrupt also has to take the host lock.

I have 4 sata connectors on the motherboard - http://www.hardwareheaven.com/reviews/asusA8V/
If i understand right, just 2 of them are linked to promise controller. Well i changed the connector where my HDDÂ is connected,
and dmesg warning message is a bit different now - "ata_bmdma_interrupt" was replaced by "pdc_interrupt".




[ÂÂÂ 1.973049] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[ÂÂÂ 1.984574]
[ÂÂÂ 1.984575] =================================
[ÂÂÂ 1.984576] [ INFO: inconsistent lock state ]
[ÂÂÂ 1.984578] 3.5.2 #4 Not tainted
[ÂÂÂ 1.984578] ---------------------------------
[ÂÂÂ 1.984579] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
[ÂÂÂ 1.984581] swapper/1/0 [HC1[1]:SC0[0]:HE0:SE1] takes:
[ÂÂÂ 1.984582]Â (&(&host->lock)->rlock){?.+...}, at: [<ffffffff8190abd8>] pdc_interrupt+0x48/0x420
[ÂÂÂ 1.984590] {HARDIRQ-ON-W} state was registered at:
[ÂÂÂ 1.984591]ÂÂ [<ffffffff810998fb>] __lock_acquire+0x61b/0x1af0
[ÂÂÂ 1.984595]ÂÂ [<ffffffff8109b31a>] lock_acquire+0x8a/0x110
[ÂÂÂ 1.984597]ÂÂ [<ffffffff81b4d051>] _raw_spin_lock+0x31/0x40
[ÂÂÂ 1.984602]ÂÂ [<ffffffff8190b3c5>] pdc_sata_hardreset+0x85/0x100
[ÂÂÂ 1.984604]ÂÂ [<ffffffff818eabba>] ata_do_reset+0x3a/0x90
[ÂÂÂ 1.984607]ÂÂ [<ffffffff818edd72>] ata_eh_reset+0x372/0xe00
[ÂÂÂ 1.984610]ÂÂ [<ffffffff818eec25>] ata_eh_recover+0x2a5/0x13d0
[ÂÂÂ 1.984612]ÂÂ [<ffffffff818f073d>] ata_do_eh+0x4d/0xb0
[ÂÂÂ 1.984614]ÂÂ [<ffffffff818f33ba>] ata_sff_error_handler+0xca/0x120
[ÂÂÂ 1.984617]ÂÂ [<ffffffff8190a9e4>] pdc_error_handler+0x24/0x30
[ÂÂÂ 1.984619]ÂÂ [<ffffffff818f029c>] ata_scsi_port_error_handler+0x47c/0x800
[ÂÂÂ 1.984622]ÂÂ [<ffffffff818f06be>] ata_scsi_error+0x9e/0xd0
[ÂÂÂ 1.984624]ÂÂ [<ffffffff816732e8>] scsi_error_handler+0xf8/0x500
[ÂÂÂ 1.984628]ÂÂ [<ffffffff810654fe>] kthread+0xae/0xc0
[ÂÂÂ 1.984631]ÂÂ [<ffffffff81b4f5f4>] kernel_thread_helper+0x4/0x10
[ÂÂÂ 1.984634] irq event stamp: 49968
[ 1.984635] hardirqs last enabled at (49965): [<ffffffff8100ab7d>] default_idle+0x5d/0x1b0
[ÂÂÂ 1.984639] hardirqs last disabled at (49966): [<ffffffff81b4da67>] common_interrupt+0x67/0x6c
[ 1.984641] softirqs last enabled at (49968): [<ffffffff810498d3>] _local_bh_enable+0x13/0x20
[ÂÂÂ 1.984645] softirqs last disabled at (49967): [<ffffffff8104a355>] irq_enter+0x75/0x90
[ÂÂÂ 1.984648]
[ÂÂÂ 1.984648] other info that might help us debug this:
[ÂÂÂ 1.984648]Â Possible unsafe locking scenario:
[ÂÂÂ 1.984648]
[ÂÂÂ 1.984649]ÂÂÂÂÂÂÂ CPU0
[ÂÂÂ 1.984649]ÂÂÂÂÂÂÂ ----
[ÂÂÂ 1.984650]ÂÂ lock(&(&host->lock)->rlock);
[ÂÂÂ 1.984651]ÂÂ <Interrupt>
[ÂÂÂ 1.984651]ÂÂÂÂ lock(&(&host->lock)->rlock);
[ÂÂÂ 1.984653]
[ÂÂÂ 1.984653]Â *** DEADLOCK ***
[ÂÂÂ 1.984653]
[ÂÂÂ 1.984654] no locks held by swapper/1/0.
[ÂÂÂ 1.984654]
[ÂÂÂ 1.984654] stack backtrace:
[ÂÂÂ 1.984656] Pid: 0, comm: swapper/1 Not tainted 3.5.2 #4
[ÂÂÂ 1.984656] Call Trace:
[ÂÂÂ 1.984657]Â <IRQ>Â [<ffffffff81b35961>] print_usage_bug+0x1f7/0x208
[ÂÂÂ 1.984662]Â [<ffffffff8101001f>] ? save_stack_trace+0x2f/0x50
[ÂÂÂ 1.984665]Â [<ffffffff81098730>] ? print_shortest_lock_dependencies+0x1d0/0x1d0
[ÂÂÂ 1.984668]Â [<ffffffff810992a2>] mark_lock+0x262/0x2a0
[ÂÂÂ 1.984670]Â [<ffffffff81099af3>] __lock_acquire+0x813/0x1af0
[ÂÂÂ 1.984672]Â [<ffffffff8107da11>] ? load_balance+0xd1/0x730
[ÂÂÂ 1.984676]Â [<ffffffff8107baa0>] ? update_cfs_shares+0x100/0x100
[ÂÂÂ 1.984678]Â [<ffffffff8109b31a>] lock_acquire+0x8a/0x110
[ÂÂÂ 1.984680]Â [<ffffffff8190abd8>] ? pdc_interrupt+0x48/0x420
[ÂÂÂ 1.984682]Â [<ffffffff81b4d051>] _raw_spin_lock+0x31/0x40
[ÂÂÂ 1.984685]Â [<ffffffff8190abd8>] ? pdc_interrupt+0x48/0x420
[ÂÂÂ 1.984686]Â [<ffffffff8107e228>] ? rebalance_domains+0x1b8/0x1e0
[ÂÂÂ 1.984689]Â [<ffffffff8190abd8>] pdc_interrupt+0x48/0x420
[ÂÂÂ 1.984691]Â [<ffffffff810c2d5d>] handle_irq_event_percpu+0x5d/0x1f0
[ÂÂÂ 1.984694]Â [<ffffffff810c2f38>] handle_irq_event+0x48/0x70
[ÂÂÂ 1.984695]Â [<ffffffff81b4d059>] ? _raw_spin_lock+0x39/0x40
[ÂÂÂ 1.984698]Â [<ffffffff810c5f0e>] ? handle_fasteoi_irq+0x1e/0xf0
[ÂÂÂ 1.984700]Â [<ffffffff810c5f4a>] handle_fasteoi_irq+0x5a/0xf0
[ÂÂÂ 1.984702]Â [<ffffffff81003ce2>] handle_irq+0x22/0x40
[ÂÂÂ 1.984706]Â [<ffffffff81b4fdaa>] do_IRQ+0x5a/0xe0
[ÂÂÂ 1.984708]Â [<ffffffff81b4da6c>] common_interrupt+0x6c/0x6c
[ÂÂÂ 1.984711]Â <EOI>Â [<ffffffff8100ab7f>] ? default_idle+0x5f/0x1b0
[ÂÂÂ 1.984714]Â [<ffffffff8100ab7d>] ? default_idle+0x5d/0x1b0
[ÂÂÂ 1.984717]Â [<ffffffff8100b816>] cpu_idle+0x106/0x120
[ÂÂÂ 1.984719]Â [<ffffffff81b2c7db>] start_secondary+0x1dd/0x1e4
[ÂÂÂ 1.984761] ata1.00: ATA-6: ST3200822AS, 3.01, max UDMA/133
[ÂÂÂ 1.984763] ata1.00: 390721968 sectors, multi 0: LBA48
[ÂÂÂ 2.006548] ata1.00: configured for UDMA/133




Here is the full dmesg - http://pastebin.com/1xLETNd5
And the crashes are continuing - heavy disk operations like "md5sum -c Ð.md5", where a.md5 contains md5 sums of more than 300 000 files,

speed up the crash between 10 and 100 times, compared to the idle case, and it crashes with oops like:




[10432.283017] BUG: unable to handle kernel NULL pointer dereference atÂÂÂÂÂÂÂÂÂÂ (null)
[10432.293110] IP: [<ffffffff8109530c>] update_ts_time_stats+0x6c/0x90
[10432.293110] PGD 7a223067 PUD 7958a067 PMD 0
[10432.293110] Oops: 0002 [#1] SMP
[10432.293110] CPU 1
[10432.293110] Modules linked in:[10432.293110]Â netconsole snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_pcm_oss snd_mixer_oss ipt_REJECT xt_tcpudp iptable_raw iptable_mangle iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack iptable_filter ip_tables x_tables lp fuse nouveau mxm_wmi wmi video ttm drm_kms_helper drm snd_via82xx snd_ac97_codec snd_mpu401_uart snd_rawmidi powernow_k8 snd_seq_device snd_pcm mperf processor snd_page_alloc snd_timer i2c_algo_bit snd soundcore ac97_bus gameport ppdev k8temp shpchp parport_pc freq_table amd64_agp i2c_viapro thermal_sys button skge parport evdev i2c_core hwmon agpgart loop

[10432.293110] Pid: 0, comm: swapper/1 Not tainted 3.5.2 #4 To Be Filled By O.E.M. To Be Filled By O.E.M./A8V Deluxe
[10432.293110] RIP: 0010:[<ffffffff8109530c>]Â [<ffffffff8109530c>] update_ts_time_stats+0x6c/0x90
[10432.293110] RSP: 0018:ffff88007fd03ed8Â EFLAGS: 00010046
[10432.293110] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[10432.293110] RDX: 0000097cf485c143 RSI: ffff88007fd0da40 RDI: 0000000000000001
[10432.293110] RBP: ffff88007fd03f18 R08: 0000000000000000 R09: 0000000000000000
[10432.293110] R10: 0000000000000000 R11: 0000000000000001 R12: 00000000000e1360
[10432.293110] R13: 000000000000da40 R14: 0000097cf485c143 R15: 0000000000000000
[10432.293110] FS:Â 00007f8de7906740(0000) GS:ffff88007fd00000(0000) knlGS:00000000f6f66700
[10432.293110] CS:Â 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[10432.293110] CR2: 0000000000000000 CR3: 000000007a9d7000 CR4: 00000000000007e0
[10432.293110] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[10432.293110] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[10432.293110] Process swapper/1 (pid: 0, threadinfo ffff88007c524000, task ffff88007d07dd60)
[10432.293110] Stack:
[10432.293110]Â 0000097cf485c143 ffff88007fd0da40 ffff88007c525fd8 ffff88007c525fd8
[10432.293110]Â 0000010000000009 0000000000000012 ffff88007fd0da40 0000000000000001
[10432.293110]Â ffff88007fd03f38 ffffffff8109535d 000000000000da40 ffff88007fd0da40
[10432.293110] Call Trace:
[10432.293110]Â <IRQ>
[10432.293110]Â [<ffffffff8109535d>] tick_nohz_stop_idle+0x2d/0x50
[10432.293110]Â [<ffffffff81095873>] tick_check_idle+0xc3/0xe0
[10432.293110]Â [<ffffffff8104a35c>] irq_enter+0x7c/0x90
[10432.293110]Â [<ffffffff81b4fe6e>] smp_apic_timer_interrupt+0x3e/0x99
[10432.293110]Â [<ffffffff81b4eedc>] apic_timer_interrupt+0x6c/0x80
[10432.293110]Â <EOI>
[10432.293110]Â [<ffffffff8100ab7d>] ? default_idle+0x5d/0x1b0
[10432.293110]Â [<ffffffff8100ab7f>] ? default_idle+0x5f/0x1b0
[10432.293110]Â [<ffffffff8100ab7d>] ? default_idle+0x5d/0x1b0
[10432.293110]Â [<ffffffff8100b816>] cpu_idle+0x106/0x120
[10432.293110]Â [<ffffffff81b2c7db>] start_secondary+0x1dd/0x1e4
[10432.293110] Code: f0 4c 8b 65 f8 c9 c3 49 89 d4 4c 2b a6 a0 00 00 00 48 89 55 c0 48 89 75 c8 e8 11 fb fd ff 48 85 c0 48 8b 55 c0 48 8b 75 c8 75 14 <4c> 01 a6 b8 00 00 00 48 89 96 a0 00 00 00 eb a8 0f 1f 40 00 4c
[10432.293110] RIPÂ [<ffffffff8109530c>] update_ts_time_stats+0x6c/0x90
[10432.293110]Â RSP <ffff88007fd03ed8>
[10432.293110] CR2: 0000000000000000
[10432.293110] [drm] nouveau 0000:01:00.0: Setting dpms mode 0 on vga encoder (output 0)
[10432.293110] ---[ end trace 9b7353d10a6013b6 ]---
[10432.293110] Kernel panic - not syncing: Fatal exception in interrupt
[10432.293110] panic occurred, switching back to text console
[10432.293110] Rebooting in 10 seconds..




>I don't know why the interrupt occurs, hotplug events are supposed to have been masked
>by ->freeze before ->hardreset. It might come from a different device

If this is the case, does it mean hardware bug ? Broken PIC, or just particular device, like integrated sound or lan ?
Is this interrupt, with unknown origin, the reason of both - dmesg warning and the crash caused this oops?

Any cure about that ? Except that i will try to disable what i can in BIOS.


>I think I'll have to introduce a new private lock just for serializing
>pdc_hard_reset_port. Expect a patch next weekend (I'll be away from
>my Promise test equipment until then.)


Could this future patch solve the "crash problem", as well as the "warning problem",
or it will not bearing on the "crash problem"?




Thanks.

Best regards.
Adko.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/