Re: [PATCH v4 09/11] mm: pgtable: reclaim empty PTE page in madvise(MADV_DONTNEED)

From: Andrew Morton
Date: Wed Dec 04 2024 - 17:36:51 EST


On Wed, 4 Dec 2024 19:09:49 +0800 Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx> wrote:

> Now in order to pursue high performance, applications mostly use some
> high-performance user-mode memory allocators, such as jemalloc or
> tcmalloc. These memory allocators use madvise(MADV_DONTNEED or MADV_FREE)
> to release physical memory, but neither MADV_DONTNEED nor MADV_FREE will
> release page table memory, which may cause huge page table memory usage.
>
> The following are a memory usage snapshot of one process which actually
> happened on our server:
>
> VIRT: 55t
> RES: 590g
> VmPTE: 110g
>
> In this case, most of the page table entries are empty. For such a PTE
> page where all entries are empty, we can actually free it back to the
> system for others to use.
>
> As a first step, this commit aims to synchronously free the empty PTE
> pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE
> pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude
> cases other than madvise(MADV_DONTNEED).
>
> Once an empty PTE is detected, we first try to hold the pmd lock within
> the pte lock. If successful, we clear the pmd entry directly (fast path).
> Otherwise, we wait until the pte lock is released, then re-hold the pmd
> and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect
> whether the PTE page is empty and free it (slow path).

"wait until the pte lock is released" sounds nasty. I'm not
immediately seeing the code which does this. PLease provide more
description?

> For other cases such as madvise(MADV_FREE), consider scanning and freeing
> empty PTE pages asynchronously in the future.
>
> The following code snippet can show the effect of optimization:
>
> mmap 50G
> while (1) {
> for (; i < 1024 * 25; i++) {
> touch 2M memory
> madvise MADV_DONTNEED 2M
> }
> }
>
> As we can see, the memory usage of VmPTE is reduced:
>
> before after
> VIRT 50.0 GB 50.0 GB
> RES 3.1 MB 3.1 MB
> VmPTE 102640 KB 240 KB
>
> ...
>
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1301,6 +1301,21 @@ config ARCH_HAS_USER_SHADOW_STACK
> The architecture has hardware support for userspace shadow call
> stacks (eg, x86 CET, arm64 GCS or RISC-V Zicfiss).
>
> +config ARCH_SUPPORTS_PT_RECLAIM
> + def_bool n
> +
> +config PT_RECLAIM
> + bool "reclaim empty user page table pages"
> + default y
> + depends on ARCH_SUPPORTS_PT_RECLAIM && MMU && SMP
> + select MMU_GATHER_RCU_TABLE_FREE
> + help
> + Try to reclaim empty user page table pages in paths other than munmap
> + and exit_mmap path.
> +
> + Note: now only empty user PTE page table pages will be reclaimed.
> +

Why is this optional? What is the case for permitting PT_RECLAIM to e
disabled?

> source "mm/damon/Kconfig"
>
> endmenu
>
> ...
>
> +void try_to_free_pte(struct mm_struct *mm, pmd_t *pmd, unsigned long addr,
> + struct mmu_gather *tlb)
> +{
> + pmd_t pmdval;
> + spinlock_t *pml, *ptl;
> + pte_t *start_pte, *pte;
> + int i;
> +
> + pml = pmd_lock(mm, pmd);
> + start_pte = pte_offset_map_rw_nolock(mm, pmd, addr, &pmdval, &ptl);
> + if (!start_pte)
> + goto out_ptl;
> + if (ptl != pml)
> + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> +
> + /* Check if it is empty PTE page */
> + for (i = 0, pte = start_pte; i < PTRS_PER_PTE; i++, pte++) {
> + if (!pte_none(ptep_get(pte)))
> + goto out_ptl;
> + }

Are there any worst-case situations in which we'll spend uncceptable
mounts of time running this loop?

> + pte_unmap(start_pte);
> +
> + pmd_clear(pmd);
> +
> + if (ptl != pml)
> + spin_unlock(ptl);
> + spin_unlock(pml);
> +
> + free_pte(mm, addr, tlb, pmdval);
> +
> + return;
> +out_ptl:
> + if (start_pte)
> + pte_unmap_unlock(start_pte, ptl);
> + if (ptl != pml)
> + spin_unlock(pml);
> +}
> --
> 2.20.1