RSS calculation in the 3.X Kernel

From: David Barto
Date: Fri Mar 25 2016 - 14:49:13 EST


Hello,
First I want to point out that I am not a Linux kernel developer, however I have done kernel development on Berkely Unix (4.X) in the distant past.

What I'm trying to discover in the Linux kernel is how the RSS is calculated in the 3.X kernels. I know that the current release is in the 4.X phase however I must work with what our customers want to use, not what I would prefer.

The kernel mailing list has excellent coverage of adding more reporting forHugeTLBPages values to the 4.X kernel and that is an interesting read. I was attempting to use that as a way to discover the RSS calculations in the 3.X kernel, however it didn't get me far.

The problem:
I have a program that uses lots of data, literally as much as physical RAM and I need to load this data in a way that I can detect when I'm running out of RAM to know when to push the 'stop loading' button; or when executing scans of this data to know when I've allocated too much working memory and again, push the 'stop' button.

The program only uses mmap/mprotect/munmap/madivse to manage memory. It will preallocate a very large amount of virtual address space using mmap as unbacked memory and then back the memory on an as-needed basis. The program traps all calls to malloc/calloc/realloc as well as both kinds of operator new along with the associated free/delete routines. All memory allocation is redirected into mmap operations.

When running I can't afford to spend time looking at a file (/proc/pid/statm) to see if memory is full, I need to know at the time of allocation that I'm done. As a result I need to know if the Linux OOM killer will shoot me down because of over subscription on the next call to allocate more memory.

Since I'm trapping all calls to any memory allocation, including allocation though the C and C++ libraries, I don't understand why the kernel is reporting a higher RSS size than I think I should have. If I think I've allocated 120GB the kernel will report that my RSS size is over 160GB. This descrepency grows larger as I load more data. Iâm not getting an error from mprotect when I attempt to add more backed memory than the system supports, which would be acceptable as an OOM error to my program. I would expect ENOMEM if I could not map the required memory, instead I get hit by the OOM killer. If anyone would like a program that demonstrates this I have one. An interesting point on the program is that after a mapping of 10GB of ram and subsequent unmapping, my RSS size has increased from 1.5MB to 2.4MB. I need to understand this kind of âbehind the scenesâ allocations charged to my program.

To this end I'm appealing to the Linux Kernel developers for a helping hint (or 3) to understand the accounting of RSS size for the 3.X kernel. I don't need a complete walk through, just a 'look here' kind of thing. I've been through the mm/mmap.c and the mm/memory.c files and I'm having no luck in putting the pieces together.

I know that the reporting is held in the mm_rss_stat structure and is initalized in init_rss_vec and updated by inline functions in mm.h.

When I walk through the unmap_page_range I see where (eventually) zap_pte_range is called and that eventually calls add_mm_rss_vec to update the various mm_counters.

When mapping I can see a call to sys_mmap_pgoff from sys_x86_64.c and can't find any definition of sys_mmap_pgoff in the kernel files. I do see a __SYSCALL(192, sys_mmap_pgoff) and a __SYSCALL(80, sys_mmap_pgoff, 6).

What else could modify the RSS of a running process? I'm not creating any new threads, I'm not forking the program. I'm just loading data (read from file, convert to internal format, MMAP some space, write to memory) for later use and that is causing me grief as the kernel's idea of my RSS far exceeds my idea.

I'm not on the Linux Developers mailing list, so please CC me in any reply.

Thanks for your time and consideration,

David Barto
barto@xxxxxxxxxxxxxxxxxxxxx