[PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic

From: Stone Wang
Date: Mon Mar 20 2006 - 08:33:21 EST


Both one of my friends(who is working on a DBMS oriented from
PostgreSQL) and i had encountered unexpected OOMs with mlock/mlockall.

After careful code-reading and tests,i found out that the reason of the
OOM is that VM's LRU algorithm treating mlocked pages as Active/Inactive,
regardless of that the mlocked pages could not be reclaimed.

Mlocking many pages will easily cause unbalance between LRU and slab:
VM tend to reclaim from Active/Inactive list,most of which are mlocked,
thus OOM may be triggered. While in fact,there are enough pages to be
reclaimed in slab.
( Setting a large "vfs_cache_pressure" may help to avoid the OOM
under this situation, but i think it's better "do things right" than
depending on the "vfs_cache_pressure" tunable)

We think that it's wrong semantic treating mlocked as Active/Inactive.
Mlocked pages should not be counted in page-reclaiming algorithm,
for in fact they will never be affected by page reclaims.

Following patch patch try to fix this, with some additions.

The patch brings Linux with:
1. Posix mlock/munlock/mlockall/munlockall.
Get mlock/munlock/mlockall/munlockall to Posix definiton: transaction-like,
just as described in the manpage(2) of mlock/munlock/mlockall/munlockall.
Thus users of mlock system call series will always have an clear map of
mlocked areas.
2. More consistent LRU semantics in Memory Management.
Mlocked pages is placed on a separate LRU list: Wired List.
The pages dont take part in LRU algorithms,for they could never be swapped,
until munlocked.
3. Output the Wired(mlocked) pages count through /proc/meminfo.
One line is added to /proc/meminfo: "Wired: N kB",thus Linux system
administrators/programmers can have a clearer map of physical memory usage.


Test of the patch:

Test envioronment:
RHEL4.
Totoal physical memory size: 256MB,no swap.
One ext3 directory("/mnt/test") with about 256 thousand small
files (each size: 2kB).

Step 1. run a task mlocking 220 MB
Step 2. run: "find /mnt/test -size 100"


Case A. Standard kernel.org kernel 2.6.15

Linux soon run OOM, OOM-time memory info:

[root@Linux ~]# cat /proc/meminfo
MemTotal: 254248 kB
MemFree: 3144 kB
Buffers: 124 kB
Cached: 1584 kB
SwapCached: 0 kB
Active: 229308 kB
Inactive: 596 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 254248 kB
LowFree: 3144 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 0 kB
Writeback: 0 kB
Mapped: 228556 kB
Slab: 20076 kB
CommitLimit: 127124 kB
Committed_AS: 238424 kB
PageTables: 584 kB
VmallocTotal: 770040 kB
VmallocUsed: 180 kB
VmallocChunk: 769844 kB
HugePages_Total: 0
HugePages_Free: 0
Hugepagesize: 4096 kB


Case B. Patched 2.6.15

No OOM happened.

[root@Linux ~]# cat /proc/meminfo
MemTotal: 254344 kB
MemFree: 3508 kB
Buffers: 6352 kB
Cached: 2684 kB
SwapCached: 0 kB
Active: 7140 kB
Inactive: 4732 kB
Wired: 225284 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 254344 kB
LowFree: 3508 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 72 kB
Writeback: 0 kB
Mapped: 229208 kB
Slab: 12552 kB
CommitLimit: 127172 kB
Committed_AS: 238168 kB
PageTables: 572 kB
VmallocTotal: 770040 kB
VmallocUsed: 180 kB
VmallocChunk: 769844 kB
HugePages_Total: 0
HugePages_Free: 0
Hugepagesize: 4096 kB


A lot thanks to Mel Gorman for your book: <Understanding the Linux Virtual
Memory Manager>. Also, thanks to other 2 great Linux kernel books: ULK3 and
LDD3.

FreeBSD's VM implementation enlightened me,thanks to FreeBSD guys.

Attachment is the full patch,following mails are what it splits up,.

Shaoping Wang

Attachment: patch-2.6.15-memlock
Description: Binary data