Re: 2.6.18 BUG: unable to handle kernel NULL pointer dereference atvirtual address 000,0000a

From: Andrew Morton
Date: Sun Sep 24 2006 - 06:24:13 EST


On Sun, 24 Sep 2006 11:11:02 +0200
Christian Weiske <cweiske@xxxxxxxxxx> wrote:

> Andrew,
>

You keep on losing Cc:s. Please preserve them all with care when replying.

>
> >> I have a reproducible BUG on my server that occurs whenever disk usage
> >> gets too high / too much swapping occurs (at least I think that is). The
> >> box has one reiserfs filesystem of about 187GB size, the disk is on an
> >> Epia 5000 board, between them is a Promise Ultra 100 PCI IDE controller
> >> card.
> > Do you think this bug is due to the 2.6.18 upgrade?
>
> No. I already had it in 2.6.17.6.
>
> > Have you run fsck across the filesystem(s)?
> fsck at boot turns up
> > ReiserFS: hde3: checking transaction log (hde3)
> > ReiserFS: hde3: replayed 22 transactions in 0 seconds
> > ReiserFS: hde3: Using r5 hash to sort names
> nothing more
>
> > Does the oops always look the same as this one?
> No, not exactly the same. I attach three log files. If you diff them,
> there will be about 30% of the lines different.
>
> One thing I have to note is that the second Oops appears about 10
> seconds after the first one.
>
> > Please turn on the various CONFIG_DEBUG_* options, see if that turns up
> > anything.
> That indeed turns up something. The debug messages indicate that java
> wants to lock something and gets stuck. Note that the messages until
> "slab corruption" are printed first, and the others about a minute or
> two later.
>
> And I still can ping and do everything until the slab corruption occurs.
> (Thus the other messages some minute later)
>
>
> > It would be interesting to find out if enabling CONFIG_4KSTACKS makes this
> > go away (although I'm not sure why).
> Didn't try this yet, but will.
>
> I put the logs in a tar.bz2 because I didn't want to flood the list with
> a 200k message.
>

OK, you have crashes in the scheduler and one crash when accessing a
reiserfs structure.

You have tcp_v6 lockdep warnings. They're in
http://xml.cweiske.de/dojo%20kernelpanic%20+%20debug.tar.bz2 is anyone is
keen. (I've largely lost interest in lockdep warnings - many of them are
false positives and require make-lockdep-shut-up patches).

You have what claims to be a netfilter-related memory corruption:

Slab corruption: start=c608a42c, len=172
Redzone: 0x6b6b6b6b/0xc0411958.
Last user: [<170fc2a5>](0x170fc2a5)
0a0: 6b 6b 6b 6b 6b 6b 6b a5 71 f0 2c 5a
Prev obj: start=c608a2c1, len=172
Redzone: 0xec0410f/0x1170fc2.
Last user: [<30000000>](0x30000000)
000: 00 00 00 10 a2 08 c6 a8 a5 08 c6 46 3a 00 00 10
010: 10 41 c0 bc a2 08 c6 20 d3 60 c0 00 00 00 00 00
slab error in cache_alloc_debugcheck_after(): cache `ip_conntrack': double freen
[<c01034b9>] show_trace+0x19/0x20
[<c01035ba>] dump_stack+0x1a/0x20
[<c0160c11>] __slab_error+0x21/0x30
[<c0162ca1>] cache_alloc_debugcheck_after+0x121/0x1a0
[<c0162ffb>] kmem_cache_alloc+0x6b/0xc0
[<c041184c>] ip_conntrack_alloc+0x3c/0x130
[<c041198a>] init_conntrack+0x2a/0x110
[<c0411c4e>] ip_conntrack_in+0x1de/0x230
BUG: unable to handle kernel NULL pointer dereference at virtual address 0000008
printing eip:



And another in what appears to be core ipv4:

Slab corruption: start=c3aff608, len=240
Redzone: 0x6b6b6b6b/0x0.
Last user: [<170fc2a5>](0x170fc2a5)
0e0: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5 71 f0 2c 5a
Prev obj: start=c3aff48f, len=240
Redzone: 0x6b6b6b6b/0x6b6b6b6b.
Last user: [<6b6b6b6b>](0x6b6b6b6b)
000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
slab error in cache_alloc_debugcheck_after(): cache `ip_dst_cache': double freen
[<c01034b9>] show_trace+0x19/0x20
[<c01035ba>] dump_stack+0x1a/0x20
[<c0160c11>] __slab_error+0x21/0x30
[<c0162ca1>] cache_alloc_debugcheck_after+0x121/0x1a0
[<c0162ffb>] kmem_cache_alloc+0x6b/0xc0
[<c03ca3c4>] dst_alloc+0x24/0x90
[<c03da865>] ip_route_input_slow+0x295/0x8c0
[<c03daf92>] ip_route_input+0x102/0x1d0
[<c03dd29a>] ip_rcv+0x27a/0x440
[<c03c6d41>] netif_receive_skb+0x1b1/0x1f0
[<c03c6e10>] process_backlog+0x90/0x120
[<c03c6f0d>] net_rx_action+0x6d/0x100
[<c011d4af>] __do_softirq+0x6f/0x100
[<c011d59f>] do_softirq+0x5f/0x70
[<c011d603>] irq_exit+0x53/0x60
[<c0104c28>] do_IRQ+0x38/0x70
[<c0103145>] common_interrupt+0x25/0x30
[<c028e19b>] memcpy+0x3b/0x50
[<c028e208>] memmove+0x38/0x50
[<c01bf85d>] leaf_paste_in_buffer+0x7d/0x320
[<c01a862c>] balance_leaf+0x24c/0x27d0
[<c01aaee0>] do_balance+0x60/0xf0
[<c01c56e4>] reiserfs_paste_into_item+0x164/0x190
[<c01b3ab5>] reiserfs_allocate_blocks_for_region+0x925/0x12e0
[<c01b5b2c>] reiserfs_file_write+0x72c/0x7c0
[<c0166768>] vfs_write+0x88/0x170
[<c01668fc>] sys_write+0x3c/0x70
[<c0102e77>] syscall_call+0x7/0xb


And another networking-related scribble:


Slab corruption: start=c64159ec, len=156
Redzone: 0x6b6b6b6b/0xc03c048a.
Last user: [<170fc2a5>](0x170fc2a5)
090: 6b 6b 6b 6b 6b 6b 6b a5 71 f0 2c 5a
Prev obj: start=c64158ec, len=156
Redzone: 0x6b6b6b6b/0x6b6b6b6b.
Last user: [<6b6b6b6b>](0x6b6b6b6b)
000: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
010: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b
slab error in cache_alloc_debugcheck_after(): cache `skbuff_head_cache': doublen
BUG: unable to handle kernel paging request at virtual address b2724e87
printing eip:
c028bf49
*pde = 00000000
Oops: 0000 [#1]


A lot of your oopses seem to point at the hrtimer code:


BUG: unable to handle kernel paging request at virtual address b2724e87
printing eip:
c028bf49
*pde = 00000000
Oops: 0000 [#1]
PREEMPT
Modules linked in:
CPU: 0
EIP: 0060:[<c028bf49>] Not tainted VLI
EFLAGS: 00010086 (2.6.18 #2)
EIP is at __rb_erase_color+0x59/0x180
eax: b2724e87 ebx: c69c1f64 ecx: cdbcdf64 edx: 00000000
esi: c69c1f64 edi: c05137d8 ebp: c6405390 esp: c6405384
ds: 007b es: 007b ss: 0068
Process Øc.{.J.. (pid: 0, ti=c6404000 task=c0511b40 task.ti=c01174d0)
Stack: c69c1f64 00000000 c05137d8 c64053b4 c028c17b 00000000 c69c1f64 c0513804
00000001 cdbcdf64 c05137d8 c05137d8 c64053cc c012f4ba cdbcdf64 c0513804
c012f7f0 cdbcdf64 c64053f0 c012f777 cdbcdf64 c05137d8 c05137dc 00000001
Call Trace:
[<c010354e>] show_stack_log_lvl+0x8e/0xb0
[<c010370a>] show_registers+0x14a/0x1d0
[<c0103987>] die+0x167/0x210
[<c010ed13>] do_page_fault+0x173/0x580
[<c0103199>] error_code+0x39/0x40
[<c028c17b>] rb_erase+0x10b/0x140
[<c012f4ba>] __remove_hrtimer+0x1a/0x40
[<c012f777>] hrtimer_run_queues+0x77/0xf0
[<c0121f46>] run_timer_softirq+0x16/0x1a0
[<c011d4af>] __do_softirq+0x6f/0x100
[<c011d59f>] do_softirq+0x5f/0x70



What a mess.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/