[PATCH][0/8] (Targeting 2.6.17) Posix memory locking and balanced mlock-LRU semantic
From: Stone Wang
Date: Mon Mar 20 2006 - 08:33:21 EST
Both one of my friends(who is working on a DBMS oriented from
PostgreSQL) and i had encountered unexpected OOMs with mlock/mlockall.
After careful code-reading and tests,i found out that the reason of the
OOM is that VM's LRU algorithm treating mlocked pages as Active/Inactive,
regardless of that the mlocked pages could not be reclaimed.
Mlocking many pages will easily cause unbalance between LRU and slab:
VM tend to reclaim from Active/Inactive list,most of which are mlocked,
thus OOM may be triggered. While in fact,there are enough pages to be
reclaimed in slab.
( Setting a large "vfs_cache_pressure" may help to avoid the OOM
under this situation, but i think it's better "do things right" than
depending on the "vfs_cache_pressure" tunable)
We think that it's wrong semantic treating mlocked as Active/Inactive.
Mlocked pages should not be counted in page-reclaiming algorithm,
for in fact they will never be affected by page reclaims.
Following patch patch try to fix this, with some additions.
The patch brings Linux with:
1. Posix mlock/munlock/mlockall/munlockall.
Get mlock/munlock/mlockall/munlockall to Posix definiton: transaction-like,
just as described in the manpage(2) of mlock/munlock/mlockall/munlockall.
Thus users of mlock system call series will always have an clear map of
mlocked areas.
2. More consistent LRU semantics in Memory Management.
Mlocked pages is placed on a separate LRU list: Wired List.
The pages dont take part in LRU algorithms,for they could never be swapped,
until munlocked.
3. Output the Wired(mlocked) pages count through /proc/meminfo.
One line is added to /proc/meminfo: "Wired: N kB",thus Linux system
administrators/programmers can have a clearer map of physical memory usage.
Test of the patch:
Test envioronment:
RHEL4.
Totoal physical memory size: 256MB,no swap.
One ext3 directory("/mnt/test") with about 256 thousand small
files (each size: 2kB).
Step 1. run a task mlocking 220 MB
Step 2. run: "find /mnt/test -size 100"
Case A. Standard kernel.org kernel 2.6.15
Linux soon run OOM, OOM-time memory info:
[root@Linux ~]# cat /proc/meminfo
MemTotal: 254248 kB
MemFree: 3144 kB
Buffers: 124 kB
Cached: 1584 kB
SwapCached: 0 kB
Active: 229308 kB
Inactive: 596 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 254248 kB
LowFree: 3144 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 0 kB
Writeback: 0 kB
Mapped: 228556 kB
Slab: 20076 kB
CommitLimit: 127124 kB
Committed_AS: 238424 kB
PageTables: 584 kB
VmallocTotal: 770040 kB
VmallocUsed: 180 kB
VmallocChunk: 769844 kB
HugePages_Total: 0
HugePages_Free: 0
Hugepagesize: 4096 kB
Case B. Patched 2.6.15
No OOM happened.
[root@Linux ~]# cat /proc/meminfo
MemTotal: 254344 kB
MemFree: 3508 kB
Buffers: 6352 kB
Cached: 2684 kB
SwapCached: 0 kB
Active: 7140 kB
Inactive: 4732 kB
Wired: 225284 kB
HighTotal: 0 kB
HighFree: 0 kB
LowTotal: 254344 kB
LowFree: 3508 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 72 kB
Writeback: 0 kB
Mapped: 229208 kB
Slab: 12552 kB
CommitLimit: 127172 kB
Committed_AS: 238168 kB
PageTables: 572 kB
VmallocTotal: 770040 kB
VmallocUsed: 180 kB
VmallocChunk: 769844 kB
HugePages_Total: 0
HugePages_Free: 0
Hugepagesize: 4096 kB
A lot thanks to Mel Gorman for your book: <Understanding the Linux Virtual
Memory Manager>. Also, thanks to other 2 great Linux kernel books: ULK3 and
LDD3.
FreeBSD's VM implementation enlightened me,thanks to FreeBSD guys.
Attachment is the full patch,following mails are what it splits up,.
Shaoping Wang
Attachment:
patch-2.6.15-memlock
Description: Binary data