Re: [PATCH] Remove softlockup from invalidate_mapping_pages. (mightbe dm related)
From: Andrew Morton
Date: Sat May 13 2006 - 11:27:04 EST
"Steinar H. Gunderson" <sgunderson@xxxxxxxxxxx> wrote:
>
> On Sat, May 13, 2006 at 07:51:47AM -0700, Andrew Morton wrote:
> > ho-hum. Please see if there's anything else you can do to rule out a
> > hardware failure, then copy dm-devel@xxxxxxxxxx on the next oops report.
>
> It's not swap related. It crashes even without swap enabled; still with lots of
> dm stuff in the backtrace, though:
(for dm-devel: 2.6.15.4 runs OK on this machine with the same config) (yes?)
> [ 3192.568880] general protection fault: 0000 [1] SMP
> [ 3192.573779] CPU 1
> [ 3192.575804] Modules linked in: w83627hf_wdt eeprom ide_generic ide_disk ide_cd cdrom ipv6 psmouse i2c_nforce2 serio_raw pcspkr i2c_core parport_pc parport rtc ext3 jbd mbcache raid6 raid5 xor raid10 raid1 raid0 linear md_mod dm_mod sd_mod sata_nv tg3 sata_sil libata scsi_mod forcedeth generic amd74xx ehci_hcd ide_core ohc i_hcd thermal processor fan unix
> [ 3192.607472] Pid: 3432, comm: md1_raid5 Not tainted 2.6.17-rc4 #1
> [ 3192.613471] RIP: 0010:[<ffffffff803a1ae8>] <ffffffff803a1ae8>{__lock_text_start+0}
> [ 3192.620870] RSP: 0018:ffff81000245bd70 EFLAGS: 00210086
> [ 3192.626374] RAX: 000000000000fa40 RBX: aaaa8b5ad269b80f RCX: ffff81000151ff50
> [ 3192.633501] RDX: ffff81007febd600 RSI: ffff81000151fef0 RDI: aaaa8b5ad269b81f
> [ 3192.640628] RBP: 000000000000fa40 R08: ffff81007db6d2c0 R09: ffff81007db6d2c0
> [ 3192.647756] R10: 0000000000000007 R11: ffffffff8024c868 R12: ffff81007febf040
> [ 3192.654882] R13: 0000000000200296 R14: ffff810004a97fb0 R15: 0000000000000000
> [ 3192.662009] FS: 0000000000000000(0000) GS:ffff81007f827840(0000) knlGS:00000000f7ad9ae0
> [ 3192.670101] CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
> [ 3192.675843] CR2: 00000000080850e0 CR3: 000000007c11e000 CR4: 00000000000006e0
> [ 3192.682970] Process md1_raid5 (pid: 3432, threadinfo ffff81007ddb6000, task ffff81007f997080)
> [ 3192.691495] Stack: ffffffff802668b8 ffff81007f27c880 ffff810049bb23c0 ffff810049bb23c0
> [ 3192.699353] 0000000000000000 ffff81007d0788b8 ffff810049bb23c0 0000000000000000
> [ 3192.707404] ffffffff880d3b67 ffff81007ed0cba8
> [ 3192.712476] Call Trace: <IRQ> <ffffffff802668b8>{kmem_cache_free+186}
> [ 3192.718950] <ffffffff880d3b67>{:dm_mod:clone_endio+135} <ffffffff802c9372>{__end_that_request_first+420}
> [ 3192.729081] <ffffffff802c7d1b>{blk_run_queue+62} <ffffffff8806f8a6>{:scsi_mod:scsi_end_request+40}
> [ 3192.738700] <ffffffff8806fb51>{:scsi_mod:scsi_io_completion+522}
> [ 3192.745334] <ffffffff880cc4a1>{:sd_mod:sd_rw_intr+623} <ffffffff880705d6>{:scsi_mod:scsi_device_unbusy+85}
> [ 3192.755641] <ffffffff802c86cb>{blk_done_softirq+113} <ffffffff8022c41b>{__do_softirq+86}
> [ 3192.764377] <ffffffff8020a742>{call_softirq+30} <ffffffff8020b902>{do_softirq+44}
> [ 3192.772509] <ffffffff8020b947>{do_IRQ+65} <ffffffff80209aa0>{ret_from_intr+0} <EOI>
> [ 3192.780822] <ffffffff881129a7>{:raid5:compute_parity+880} <ffffffff802d5e2f>{memcmp+11}
> [ 3192.789476] <ffffffff8811467a>{:raid5:handle_stripe+3022} <ffffffff80238a7c>{keventd_create_kthread+0}
> [ 3192.799424] <ffffffff80238a7c>{keventd_create_kthread+0} <ffffffff881153e9>{:raid5:raid5d+333}
> [ 3192.808682] <ffffffff880e464f>{:md_mod:md_thread+0} <ffffffff880e4751>{:md_mod:md_thread+258}
> [ 3192.817864] <ffffffff80238e78>{autoremove_wake_function+0} <ffffffff880e464f>{:md_mod:md_thread+0}
> [ 3192.827471] <ffffffff80238cc4>{kthread+203} <ffffffff8020a3f2>{child_rip+8}
> [ 3192.835081] <ffffffff80238a7c>{keventd_create_kthread+0} <ffffffff80238bf9>{kthread+0}
> [ 3192.843646] <ffffffff8020a3ea>{child_rip+0}
> [ 3192.848646]
> [ 3192.848647] Code: f0 ff 0f 0f 88 c8 01 00 00 c3 f0 ff 0f 8b 07 ba 01 00 00 00
> [ 3192.857563] RIP <ffffffff803a1ae8>{__lock_text_start+0} RSP <ffff81000245bd70>
> [ 3192.865039] <0>Kernel panic - not syncing: Aiee, killing interrupt handler!
> [ 3192.872164] <0>Rebooting in 60 seconds..
>
> (Thank goodness for serial console; I couldn't possibly write all these oopses
> by hand. :-) )
>
> > The stack backtrace you have there is a little surprising. Enabling
> > CONFIG_FRAME_POINTER might help clear it up. Also it'd be worth seeing if
> > CONFIG_DEBUG_SLAB turns up anything.
>
> I'm recompiling 2.6.17-rc4 now with those two added in. I'll let you know in a
> few hours when it crashes again, I'd guess :-)
>
> Would it be a good idea to revert your mm patch and test again, just in case?
Which patch? remove-softlockup-from-invalidate_mapping_pages.patch? No,
that won't have caused this. But then, it's not really obvious what this
crash is.
Now your earlier trace had this important info:
[ 1127.842645] Unable to handle kernel NULL pointer dereference at 0000000000000040 RIP:
[ 1127.848117] <ffffffff803a1ae8>{__lock_text_start+0}
[ 1127.855474] PGD 5e38a067 PUD 5e39d067 PMD 0
[ 1127.859770] Oops: 0002 [1] SMP
[ 1127.862931] CPU 1
Which kind of implies that we passed a null (or very small small) `struct
kmem_cache' pointer into kmem_cache_free(). But that doesn't seem like the
sort of thing which will take hours to reproduce.
Do you have CONFIG_NUMA set?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/