Overcommit problems with 2.6.12-rc4 (on AMD64)

From: Steinar H. Gunderson
Date: Thu Jun 02 2005 - 12:23:14 EST


(Please Cc me on answers, I don't follow LKML.)

Hi,

Suddenly one of our servers, a Dual Opteron with 2GB memory (running 32-bit
userland, but 64-bit kernel) started to behave oddly:

imapd[31528]: segfault at 00000000fff00000 rip 00000000556a1a6d rsp 00000000ffffd394 error 4
imapd[31527]: segfault at 00000000fff00000 rip 00000000556a1a6d rsp 00000000ffffcbe4 error 4
sh[31530]: segfault at 00000000ffff7ff4 rip 000000005555e556 rsp 00000000ffff7ff8 error 6
sh[31531]: segfault at 00000000ffff7e5c rip 00000000555dc575 rsp 00000000ffff7e60 error 6
Unable to load interpreter /lib/ld-linux.so.2
Unable to load interpreter /lib/ld-linux.so.2
(ad infinitum)

It turned out it had some sort of memory problem:

Jun 2 11:56:02 cassarossa smbd[7171]: oplock_break: malloc fail for input buffer.
Jun 2 11:56:02 cassarossa smbd[7171]: open_mode_check: FAILED when breaking oplock (3) on file login.bat, dev = 900, inode = 110665

This wasn't a RAM problem, as the machine has ECC RAM and we received no
warnings from it. Also, we definitely had enough swap:

cassarossa:~# free
total used free shared buffers cached
Mem: 2058300 2041136 17164 0 39576 1601468
-/+ buffers/cache: 400092 1658208
Swap: 3903712 0 3903712

It looks like somehow, the kernel couldn't really distinguish between memory
used as cache and just "used". It couldn't even swapoff:

cassarossa:~# swapoff -a
swapoff: /dev/sda5: Cannot allocate memory
swapoff: /dev/sdf5: Cannot allocate memory

However, we run with vm.overcommit_memory=2, so we figured out it was worth a
shot:

cassarossa:~# echo 0 > /proc/sys/vm/overcommit_memory
cassarossa:~# swapoff -a
cassarossa:~# swapon -a
cassarossa:~# free -m
total used free shared buffers cached
Mem: 2010 1993 16 0 39 1595
-/+ buffers/cache: 358 1651
Swap: 3812 0 3812

Suddenly everything seems to be back to normal (ie. we could swapoff, and the
programs stopped running out of memory; no changes in the cache used,
though), and after a quick restart of services, everything is back to normal.
So to me, it looks like vm.overcommit_memory=2 is broken, at least on AMD64.
Any ideas why this would happen?

for the record:

cassarossa:~# uname -a
Linux cassarossa 2.6.12-rc4 #1 SMP Fri May 13 18:49:40 CEST 2005 x86_64 unknown

No kernel patches except for a microscopic forward-port of the ELF fix from
2.6.11.9.

/* Steinar */
--
Homepage: http://www.sesse.net/
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/