On 05.08.24 14:55, Qi Zheng wrote:
Now in order to pursue high performance, applications mostly use some
high-performance user-mode memory allocators, such as jemalloc or
tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE)
to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will
release page table memory, which may cause huge page table memory usage.
The following are a memory usage snapshot of one process which actually
happened on our server:
VIRT: 55t
RES: 590g
VmPTE: 110g
In this case, most of the page table entries are empty. For such a PTE
page where all entries are empty, we can actually free it back to the
system for others to use.
As a first step, this commit attempts to synchronously free the empty PTE
pages in zap_page_range_single() (MADV_DONTNEED etc will invoke this). In
order to reduce overhead, we only handle the cases with a high probability
of generating empty PTE pages, and other cases will be filtered out, such
as:
It doesn't make particular sense during munmap() where we will just remove the page tables manually directly afterwards. We should limit it to the !munmap case -- in particular MADV_DONTNEED.
To minimze the added overhead, I further suggest to only try reclaim asynchronously if we know that likely all ptes will be none, that is,
when we just zapped *all* ptes of a PTE page table -- our range spans the complete PTE page table.
Just imagine someone zaps a single PTE, we really don't want to start scanning page tables and involve an (rather expensive) walk_page_range just to find out that there is still something mapped.
Last but not least, would there be a way to avoid the walk_page_range() and simply trigger it from zap_pte_range(), possibly still while holding the PTE table lock?
We might have to trylock the PMD, but that should be doable.