Re: [PATCH v2 1/1] mm/madvise: add MADV_F_COLLAPSE_LIGHT to process_madvise()

From: Michal Hocko
Date: Thu Jan 18 2024 - 08:29:07 EST


[CC linux-api]

On Thu 18-01-24 20:03:46, Lance Yang wrote:
> This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1].
>
> Allow MADV_F_COLLAPSE_LIGHT behavior for process_madvise(2) if the caller
> has CAP_SYS_ADMIN or is requesting the collapse of its own memory.
>
> The semantics of MADV_F_COLLAPSE_LIGHT are similar to MADV_COLLAPSE, but
> it avoids direct reclaim and/or compaction, quickly failing on allocation
> errors.
>
> This change enables a more flexible and efficient usage of memory collapse
> operations, providing additional control to userspace applications for
> system-wide THP optimization.
>
> Semantics
>
> This call is independent of the system-wide THP sysfs settings, but will
> fail for memory marked VM_NOHUGEPAGE. If the ranges provided span
> multiple VMAs, the semantics of the collapse over each VMA is independent
> from the others. This implies a hugepage cannot cross a VMA boundary. If
> collapse of a given hugepage-aligned/sized region fails, the operation may
> continue to attempt collapsing the remainder of memory specified.
>
> The memory ranges provided must be page-aligned, but are not required to
> be hugepage-aligned. If the memory ranges are not hugepage-aligned, the
> start/end of the range will be clamped to the first/last hugepage-aligned
> address covered by said range. The memory ranges must span at least one
> hugepage-sized region.
>
> All non-resident pages covered by the range will first be
> swapped/faulted-in, before being internally copied onto a freshly
> allocated hugepage. Unmapped pages will have their data directly
> initialized to 0 in the new hugepage. However, for every eligible
> hugepage aligned/sized region to-be collapsed, at least one page must
> currently be backed by memory (a PMD covering the address range must
> already exist).
>
> Allocation for the new hugepage will not enter direct reclaim and/or
> compaction, quickly failing if allocation fails. When the system has
> multiple NUMA nodes, the hugepage will be allocated from the node providing
> the most native pages. This operation operates on the current state of the
> specified process and makes no persistent changes or guarantees on how pages
> will be mapped, constructed, or faulted in the future.
>
> Use Cases
>
> An immediate user of this new functionality is the Go runtime heap allocator
> that manages memory in hugepage-sized chunks. In the past, whether it was a
> newly allocated chunk through mmap() or a reused chunk released by
> madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory with
> huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[3]
> respectively. However, both approaches resulted in performance issues; for
> both scenarios, there could be entries into direct reclaim and/or compaction,
> leading to unpredictable stalls[4]. Now, the allocator can confidently use
> process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of huge pages.
>
> [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404faca29a82689c77
> [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740996181268b60a3a
> [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be4b3a9bd2af
> [4] https://github.com/golang/go/issues/63334
>
> [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@xxxxxxxxx/
>
> Signed-off-by: Lance Yang <ioworker0@xxxxxxxxx>
> Suggested-by: Zach O'Keefe <zokeefe@xxxxxxxxxx>
> Suggested-by: David Hildenbrand <david@xxxxxxxxxx>
> ---
> V1 -> V2: Treat process_madvise(MADV_F_COLLAPSE_LIGHT) as the lighter-weight alternative
> to madvise(MADV_COLLAPSE)
>
> arch/alpha/include/uapi/asm/mman.h | 1 +
> arch/mips/include/uapi/asm/mman.h | 1 +
> arch/parisc/include/uapi/asm/mman.h | 1 +
> arch/xtensa/include/uapi/asm/mman.h | 1 +
> include/linux/huge_mm.h | 5 +--
> include/uapi/asm-generic/mman-common.h | 1 +
> mm/khugepaged.c | 15 ++++++--
> mm/madvise.c | 36 +++++++++++++++++---
> tools/include/uapi/asm-generic/mman-common.h | 1 +
> 9 files changed, 52 insertions(+), 10 deletions(-)
>
> diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
> index 763929e814e9..22f23ca04f1a 100644
> --- a/arch/alpha/include/uapi/asm/mman.h
> +++ b/arch/alpha/include/uapi/asm/mman.h
> @@ -77,6 +77,7 @@
> #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */
>
> #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
> +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
>
> /* compatibility flags */
> #define MAP_FILE 0
> diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
> index c6e1fc77c996..acec0b643e9c 100644
> --- a/arch/mips/include/uapi/asm/mman.h
> +++ b/arch/mips/include/uapi/asm/mman.h
> @@ -104,6 +104,7 @@
> #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */
>
> #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
> +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
>
> /* compatibility flags */
> #define MAP_FILE 0
> diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
> index 68c44f99bc93..812029c98cd7 100644
> --- a/arch/parisc/include/uapi/asm/mman.h
> +++ b/arch/parisc/include/uapi/asm/mman.h
> @@ -71,6 +71,7 @@
> #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */
>
> #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
> +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
>
> #define MADV_HWPOISON 100 /* poison a page for testing */
> #define MADV_SOFT_OFFLINE 101 /* soft offline page for testing */
> diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> index 1ff0c858544f..52ef463dd5b6 100644
> --- a/arch/xtensa/include/uapi/asm/mman.h
> +++ b/arch/xtensa/include/uapi/asm/mman.h
> @@ -112,6 +112,7 @@
> #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */
>
> #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
> +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
>
> /* compatibility flags */
> #define MAP_FILE 0
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 5adb86af35fc..075fdb5d481a 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -303,7 +303,7 @@ int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags,
> int advice);
> int madvise_collapse(struct vm_area_struct *vma,
> struct vm_area_struct **prev,
> - unsigned long start, unsigned long end);
> + unsigned long start, unsigned long end, int behavior);
> void vma_adjust_trans_huge(struct vm_area_struct *vma, unsigned long start,
> unsigned long end, long adjust_next);
> spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma);
> @@ -450,7 +450,8 @@ static inline int hugepage_madvise(struct vm_area_struct *vma,
>
> static inline int madvise_collapse(struct vm_area_struct *vma,
> struct vm_area_struct **prev,
> - unsigned long start, unsigned long end)
> + unsigned long start, unsigned long end,
> + int behavior)
> {
> return -EINVAL;
> }
> diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
> index 6ce1f1ceb432..92c67bc755da 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -78,6 +78,7 @@
> #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */
>
> #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
> +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
>
> /* compatibility flags */
> #define MAP_FILE 0
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 2b219acb528e..2840051c0ae2 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -97,6 +97,8 @@ static struct kmem_cache *mm_slot_cache __ro_after_init;
> struct collapse_control {
> bool is_khugepaged;
>
> + int behavior;
> +
> /* Num pages scanned per node */
> u32 node_load[MAX_NUMNODES];
>
> @@ -1058,10 +1060,16 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
> static int alloc_charge_hpage(struct page **hpage, struct mm_struct *mm,
> struct collapse_control *cc)
> {
> - gfp_t gfp = (cc->is_khugepaged ? alloc_hugepage_khugepaged_gfpmask() :
> - GFP_TRANSHUGE);
> int node = hpage_collapse_find_target_node(cc);
> struct folio *folio;
> + gfp_t gfp;
> +
> + if (cc->is_khugepaged)
> + gfp = alloc_hugepage_khugepaged_gfpmask();
> + else
> + gfp = (cc->behavior == MADV_F_COLLAPSE_LIGHT ?
> + GFP_TRANSHUGE_LIGHT :
> + GFP_TRANSHUGE);
>
> if (!hpage_collapse_alloc_folio(&folio, gfp, node, &cc->alloc_nmask)) {
> *hpage = NULL;
> @@ -2697,7 +2705,7 @@ static int madvise_collapse_errno(enum scan_result r)
> }
>
> int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> - unsigned long start, unsigned long end)
> + unsigned long start, unsigned long end, int behavior)
> {
> struct collapse_control *cc;
> struct mm_struct *mm = vma->vm_mm;
> @@ -2718,6 +2726,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
> if (!cc)
> return -ENOMEM;
> cc->is_khugepaged = false;
> + cc->behavior = behavior;
>
> mmgrab(mm);
> lru_add_drain_all();
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 912155a94ed5..9c40226505aa 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -60,6 +60,7 @@ static int madvise_need_mmap_write(int behavior)
> case MADV_POPULATE_READ:
> case MADV_POPULATE_WRITE:
> case MADV_COLLAPSE:
> + case MADV_F_COLLAPSE_LIGHT:
> return 0;
> default:
> /* be safe, default to 1. list exceptions explicitly */
> @@ -1082,8 +1083,9 @@ static int madvise_vma_behavior(struct vm_area_struct *vma,
> if (error)
> goto out;
> break;
> + case MADV_F_COLLAPSE_LIGHT:
> case MADV_COLLAPSE:
> - return madvise_collapse(vma, prev, start, end);
> + return madvise_collapse(vma, prev, start, end, behavior);
> }
>
> anon_name = anon_vma_name(vma);
> @@ -1178,6 +1180,7 @@ madvise_behavior_valid(int behavior)
> case MADV_HUGEPAGE:
> case MADV_NOHUGEPAGE:
> case MADV_COLLAPSE:
> + case MADV_F_COLLAPSE_LIGHT:
> #endif
> case MADV_DONTDUMP:
> case MADV_DODUMP:
> @@ -1194,6 +1197,17 @@ madvise_behavior_valid(int behavior)
> }
> }
>
> +
> +static bool process_madvise_behavior_only(int behavior)
> +{
> + switch (behavior) {
> + case MADV_F_COLLAPSE_LIGHT:
> + return true;
> + default:
> + return false;
> + }
> +}
> +
> static bool process_madvise_behavior_valid(int behavior)
> {
> switch (behavior) {
> @@ -1201,6 +1215,7 @@ static bool process_madvise_behavior_valid(int behavior)
> case MADV_PAGEOUT:
> case MADV_WILLNEED:
> case MADV_COLLAPSE:
> + case MADV_F_COLLAPSE_LIGHT:
> return true;
> default:
> return false;
> @@ -1368,6 +1383,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> * transparent huge pages so the existing pages will not be
> * coalesced into THP and new pages will not be allocated as THP.
> * MADV_COLLAPSE - synchronously coalesce pages into new THP.
> + * MADV_F_COLLAPSE_LIGHT - only for process_madvise, avoids direct reclaim and/or
> + * compaction.
> * MADV_DONTDUMP - the application wants to prevent pages in the given range
> * from being included in its core dump.
> * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> @@ -1394,7 +1411,8 @@ int madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
> * -EBADF - map exists, but area maps something that isn't a file.
> * -EAGAIN - a kernel resource was temporarily unavailable.
> */
> -int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
> +int _do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in,
> + int behavior, bool is_process_madvise)
> {
> unsigned long end;
> int error;
> @@ -1405,6 +1423,9 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> if (!madvise_behavior_valid(behavior))
> return -EINVAL;
>
> + if (!is_process_madvise && process_madvise_behavior_only(behavior))
> + return -EINVAL;
> +
> if (!PAGE_ALIGNED(start))
> return -EINVAL;
> len = PAGE_ALIGN(len_in);
> @@ -1448,9 +1469,14 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
> return error;
> }
>
> +int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int behavior)
> +{
> + return _do_madvise(mm, start, len_in, behavior, false);
> +}
> +
> SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> {
> - return do_madvise(current->mm, start, len_in, behavior);
> + return _do_madvise(current->mm, start, len_in, behavior, false);
> }
>
> SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> @@ -1504,8 +1530,8 @@ SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec,
> total_len = iov_iter_count(&iter);
>
> while (iov_iter_count(&iter)) {
> - ret = do_madvise(mm, (unsigned long)iter_iov_addr(&iter),
> - iter_iov_len(&iter), behavior);
> + ret = _do_madvise(mm, (unsigned long)iter_iov_addr(&iter),
> + iter_iov_len(&iter), behavior, true);
> if (ret < 0)
> break;
> iov_iter_advance(&iter, iter_iov_len(&iter));
> diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
> index 6ce1f1ceb432..92c67bc755da 100644
> --- a/tools/include/uapi/asm-generic/mman-common.h
> +++ b/tools/include/uapi/asm-generic/mman-common.h
> @@ -78,6 +78,7 @@
> #define MADV_DONTNEED_LOCKED 24 /* like DONTNEED, but drop locked pages too */
>
> #define MADV_COLLAPSE 25 /* Synchronous hugepage collapse */
> +#define MADV_F_COLLAPSE_LIGHT 26 /* Similar to COLLAPSE, but avoids direct reclaim and/or compaction */
>
> /* compatibility flags */
> #define MAP_FILE 0
> --
> 2.33.1

--
Michal Hocko
SUSE Labs