Random crashes (was Re: Crashes and lockups in XFS filesystem (2.6.8-rc4).)

From: David Martinez Moreno
Date: Thu Aug 19 2004 - 12:42:19 EST

Next message: Horst von Brand: "Re: PATCH: cdrecord: avoiding scsi device numbering for ide devices"
Previous message: Martin Mares: "Re: PATCH: cdrecord: avoiding scsi device numbering for ide devices"
In reply to: David Martínez Moreno: "Re: Crashes and lockups in XFS filesystem (2.6.8-rc4)."
Next in thread: Nathan Scott: "Re: Crashes and lockups in XFS filesystem (2.6.8-rc4 and 2.6.8.1-mm2)."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

El Jueves, 19 de Agosto de 2004 10:44, Nathan Scott escribió:
> On Wed, Aug 18, 2004 at 06:16:57PM +0200, David Martinez Moreno wrote:
> > Hello, I am getting persistent lockups that could be IMHO XFS-related. I
> > created a fresh XFS filesystem in a SCSI disk, with xfsprogs version
> > 2.6.18.
> >
> > Mounted /dev/sda1 under /mnt, after that, I have been copying lots of
> > files from /dev/md0, then run a find blabla -exec rm \{\{ \; over /mnt
> > and then voilà! the lockup:
>
> Did /mnt run out of space while doing that? Or nearly? There's
> a known issue with that area of the XFS code, in conjunction with
> 4K stacks at the moment - was that enabled in your .config?
>
> Looks like something stamped on parts of the xfs_mount structure
> for the filesystem mounted at /mnt, a stack overrun would explain
> that and your subsequent oopsen.

Hello, Nathan. I am here at the University now, and yes, the kernel were
compiled with 4K stacks. I rebuilt it with 8K and rebooted:

ulises:~# zcat /proc/config.gz |grep STACKS
# CONFIG_4KSTACKS is not set

I ran a disk test and waited.

After an hour, the system crashed. I forgot to plug in a console, so rebooted
and waited again. And here it is:

------------[ cut here ]------------
kernel BUG at mm/vmscan.c:565!
invalid operand: 0000 [#1]
CPU: 0
EIP: 0060:[<c01307f1>] Not tainted
EFLAGS: 00010046 (2.6.8.1)
EIP is at shrink_cache+0x7c/0x2c7
eax: 00000000 ebx: c0415d24 ecx: c12485f8 edx: c12485f8
esi: 0000001d edi: c0415d48 ebp: 0000001c esp: c15bfeb0
ds: 007b es: 007b ss: 0068
Process kswapd0 (pid: 10, threadinfo=c15be000 task=c15720b0)
Stack: c15ab080 00000002 c15be000 00000020 c10444f8 c10f8b18 00000000 00000001
00000286 c15c0130 c15bfee4 c012ddd8 00000000 00000286 00000000 00000000
00000000 00000000 c15bfef8 c15bfef8 00000000 00000000 00000000 dff28180
Call Trace:
[<c012ddd8>] drain_cpu_caches+0x40/0x42
[<c0130085>] shrink_slab+0x7b/0x186
[<c0130f19>] shrink_zone+0x9e/0xb8
[<c01312d6>] balance_pgdat+0x1c9/0x22d
[<c0131401>] kswapd+0xc7/0xd7
[<c0112dac>] autoremove_wake_function+0x0/0x57
[<c0112dac>] autoremove_wake_function+0x0/0x57
[<c013133a>] kswapd+0x0/0xd7
[<c0101fdd>] kernel_thread_helper+0x5/0xb
Code: 0f 0b 35 02 02 81 3c c0 8b 51 04 8b 01 89 50 04 89 02 c7 41
<1>Unable to handle kernel NULL pointer dereference at virtual address
00000004
printing eip:
c012e3c7
*pde = 00000000
Oops: 0000 [#2]
CPU: 0
EIP: 0060:[<c012e3c7>] Not tainted
EFLAGS: 00010006 (2.6.8.1)
EIP is at free_block+0x43/0xcb
eax: 00800000 ebx: 00000000 ecx: 00000000 edx: c1000000
esi: c0415d48 edi: 00000000 ebp: c0415d54 esp: c15bfb20
ds: 007b es: 007b ss: 0068
Process kswapd0 (pid: 10, threadinfo=c15be000 task=c15720b0)
Stack: c022b4c2 dffe0090 c0415d64 2002002c d242f800 00000286 c1401480 c012e499
c0415d48 c1161948 2002002c c1161948 c1161938 d242f800 00000286 0000062a
c012e7a6 c0415d48 c1161938 c1a4ce80 dcdab000 00000004 c0338237 d242f800
Call Trace:
[<c022b4c2>] memmove+0x4d/0x4f
[<c012e499>] cache_flusharray+0x4a/0xb6
[<c012e7a6>] kfree+0x5e/0x62
[<c0338237>] kfree_skbmem+0x13/0x2c
[<c03382b8>] __kfree_skb+0x68/0xdd
[<c03591aa>] tcp_clean_rtx_queue+0x12f/0x3a0
[<c0359a53>] tcp_ack+0xb4/0x560
[<c035bc4d>] __tcp_data_snd_check+0xd3/0xe2
[<c035c454>] tcp_rcv_established+0x419/0x84a
[<c036470d>] tcp_v4_do_rcv+0x117/0x11c
[<c0364d51>] tcp_v4_rcv+0x63f/0x884
[<c025b3a5>] scrup+0xe3/0xf7
[<c034badf>] ip_local_deliver+0xa3/0x1a2
[<c034bec4>] ip_rcv+0x2e6/0x3fe
[<c033d335>] netif_receive_skb+0x14b/0x17e
[<c033d3dd>] process_backlog+0x75/0xf6
[<c033d4c8>] net_rx_action+0x6a/0xe2
[<c0117d52>] __do_softirq+0x7e/0x80
[<c0117d7a>] do_softirq+0x26/0x28
[<c0105e79>] do_IRQ+0xc4/0xdf
[<c0104e82>] do_invalid_op+0x0/0xcb
[<c0104460>] common_interrupt+0x18/0x20
[<c0104e82>] do_invalid_op+0x0/0xcb
[<c0104b61>] die+0x7a/0xcd
[<c0104f4b>] do_invalid_op+0xc9/0xcb
[<c01307f1>] shrink_cache+0x7c/0x2c7
[<c02217da>] vn_purge+0x108/0x116
[<c010451d>] error_code+0x2d/0x38
[<c01307f1>] shrink_cache+0x7c/0x2c7
[<c012ddd8>] drain_cpu_caches+0x40/0x42
[<c0130085>] shrink_slab+0x7b/0x186
[<c0130f19>] shrink_zone+0x9e/0xb8
[<c01312d6>] balance_pgdat+0x1c9/0x22d
[<c0131401>] kswapd+0xc7/0xd7
[<c0112dac>] autoremove_wake_function+0x0/0x57
[<c0112dac>] autoremove_wake_function+0x0/0x57
[<c013133a>] kswapd+0x0/0xd7
[<c0101fdd>] kernel_thread_helper+0x5/0xb
Code: 8b 53 04 8b 03 89 50 04 89 02 c7 43 04 00 02 20 00 2b 4b 0c
<0>Kernel panic: Fatal exception in interrupt
In interrupt handler - not syncing

Last time I seem to remember some function related to ext3, so this is more a
data/stack corruption than others, what do you think?

I attach current dmesg and config, if it helps.

Is there anything else I could do? Patches, other trees, special
configurations? This machine seems impossible to stabilize.

The machine is using SiI libata driver, if it could have some problem
related...

Many thanks in advance,

Ender.
--
Why is a cow? Mu. (Ommmmmmmmmm)

Attachment: dmesg.gz
Description: GNU Zip compressed data

Attachment: config.gz
Description: GNU Zip compressed data

Next message: Horst von Brand: "Re: PATCH: cdrecord: avoiding scsi device numbering for ide devices"
Previous message: Martin Mares: "Re: PATCH: cdrecord: avoiding scsi device numbering for ide devices"
In reply to: David Martínez Moreno: "Re: Crashes and lockups in XFS filesystem (2.6.8-rc4)."
Next in thread: Nathan Scott: "Re: Crashes and lockups in XFS filesystem (2.6.8-rc4 and 2.6.8.1-mm2)."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]