Re: Frequent oops in shrink_mmap

Rik van Riel (riel@nl.linux.org)
Thu, 18 Nov 1999 13:42:08 +0100 (CET)


On Wed, 17 Nov 1999, David desJardins wrote:

> Google runs on a cluster of 2000+ Linux/Intel machines.
> We have a problem on several of those machines, a symptom of which
> is kernel oops messages in try_to_free_buffers, called from
> shrink_mmap. (The machines which have these kernel oops messages
> tend to have poor performance in general, even at times when the
> oops aren't being generated.)

That sounds consistant with having an error in try_to_free_buffers.
When that function is in trouble (or very heavily loaded), that
means that the machine is already tight on memory and bound to
have poor performance.

What we need to find out now is why things are going wrong and
how we can avoid both the oops and the slow performance. The
bug sounds *ahum* interesting, so we really should do something
about it...

> We are currently running 2.2.7 on these machines, but I see from
> searching linux-kernel traffic that the problem of kernel oops in
> shrink_mmap is a long-standing one which has been reported in many
> different kernel versions, up to at least 2.2.12, and has also
> been reported in Linux/Alpha. I've looked at all of the past
> archives, and I didn't see any definitive response to these
> problems.

I think Alan will be able to tell us if any changes to the locking
infrastructure or buffer cache have been integrated in 2.2.x that
could help fix this bug. There's quite a bit of pointer magic going
on in that part of the kernel, so every little thing could help :)

> *** Linux s17 2.2.7 #6 Mon May 10 16:58:52 PDT 1999 i686 unknown
> Nov 15 15:50:42 s17 kernel: Unable to handle kernel paging request at virtual address 40000014
> Nov 15 15:50:42 s17 kernel: current->tss.cr3 = 00101000, %cr3 = 00101000
> Nov 15 15:50:42 s17 kernel: *pde = 00000000
> Nov 15 15:50:42 s17 kernel: Oops: 0000
> Nov 15 15:50:42 s17 kernel: CPU: 0
> Nov 15 15:50:42 s17 kernel: EIP: 0010:[try_to_free_buffers+18/136]
> Nov 15 15:50:42 s17 kernel: EFLAGS: 00010202
> Nov 15 15:50:42 s17 kernel: eax: 40000000 ebx: c029b320 ecx: 00000006 edx: 00020000
> Nov 15 15:50:42 s17 kernel: esi: 40000000 edi: 40000000 ebp: c029b320 esp: c000dfa8
> Nov 15 15:50:42 s17 kernel: ds: 0018 es: 0018 ss: 0018
> Nov 15 15:50:42 s17 kernel: Process kswapd (pid: 4, process nr: 4, stackpage=c000d000)
> Nov 15 15:50:42 s17 kernel: Stack: 00000030 c000c000 c011b386 c029b320 00000020 00000006 c01200eb 00000006
> Nov 15 15:50:42 s17 kernel: 00000030 00000000 c01a1eee c000c1c1 c0120197 00000030 00000f00 c0003fb4
> Nov 15 15:50:42 s17 kernel: c0106000 00009000 c01073eb 00000000 00000f00 c01cdfd8
> Nov 15 15:50:42 s17 kernel: Call Trace: [shrink_mmap+218/300] [do_try_to_free_pages+35/120] [tvecs+5218/20156] [kswapd+87/200] [get_options+0/116] [kernel_thread+35/48]
> Nov 15 15:50:42 s17 kernel: Code: 8b 76 14 83 78 20 00 75 06 f6 40 18 46 74 13 6a 00 e8 4c 01

Hrmmm, it would help a lot if you could drag this numeric
oops through the ksymoops program to give us symbolic output :)

Rik

--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/