Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit

From: Michal Hocko
Date: Mon Nov 27 2017 - 05:12:40 EST


On Sun 26-11-17 17:09:32, Mikael Pettersson wrote:
> The `vm.max_map_count' sysctl limit is IMO useless and confusing, so
> this patch disables it.
>
> - Old ELF had a limit of 64K segments, making core dumps from processes
> with more mappings than that problematic, but that was fixed in March
> 2010 ("elf coredump: add extended numbering support").
>
> - There are no internal data structures sized by this limit, making it
> entirely artificial.

each mapping has its vma structure and that in turn can be tracked by
other data structures so this is not entirely true.

> - When viewed as a limit on memory consumption, it is ineffective since
> the number of mappings does not correspond directly to the amount of
> memory consumed, since each mapping is variable-length.
>
> - Reaching the limit causes various memory management system calls to
> fail with ENOMEM, which is a lie. Combined with the unpredictability
> of the number of mappings in a process, especially when non-trivial
> memory management or heavy file mapping is used, it can be difficult
> to reproduce these events and debug them. It's also confusing to get
> ENOMEM when you know you have lots of free RAM.
>
> This limit was apparently introduced in the 2.1.80 kernel (first as a
> compile-time constant, later changed to a sysctl), but I haven't been
> able to find any description for it in Git or the LKML archives, so I
> don't know what the original motivation was.
>
> I've kept the kernel tunable to not break the API towards user-space,
> but it's a no-op now. Also the distinction between split_vma() and
> __split_vma() disappears, so they are merged.

Could you be more explicit about _why_ we need to remove this tunable?
I am not saying I disagree, the removal simplifies the code but I do not
really see any justification here.

> Tested on x86_64 with Fedora 26 user-space. Also built an ARM NOMMU
> kernel to make sure NOMMU compiles and links cleanly.
>
> Signed-off-by: Mikael Pettersson <mikpelinux@xxxxxxxxx>
> ---
> Documentation/sysctl/vm.txt | 17 +-------------
> Documentation/vm/ksm.txt | 4 ----
> Documentation/vm/remap_file_pages.txt | 4 ----
> fs/binfmt_elf.c | 4 ----
> include/linux/mm.h | 23 -------------------
> kernel/sysctl.c | 3 +++
> mm/madvise.c | 12 ++--------
> mm/mmap.c | 42 ++++++-----------------------------
> mm/mremap.c | 7 ------
> mm/nommu.c | 3 ---
> mm/util.c | 1 -
> 11 files changed, 13 insertions(+), 107 deletions(-)
>
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index b920423f88cb..0fcb511d07e6 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -35,7 +35,7 @@ Currently, these files are in /proc/sys/vm:
> - laptop_mode
> - legacy_va_layout
> - lowmem_reserve_ratio
> -- max_map_count
> +- max_map_count (unused, kept for backwards compatibility)
> - memory_failure_early_kill
> - memory_failure_recovery
> - min_free_kbytes
> @@ -400,21 +400,6 @@ The minimum value is 1 (1/1 -> 100%).
>
> ==============================================================
>
> -max_map_count:
> -
> -This file contains the maximum number of memory map areas a process
> -may have. Memory map areas are used as a side-effect of calling
> -malloc, directly by mmap, mprotect, and madvise, and also when loading
> -shared libraries.
> -
> -While most applications need less than a thousand maps, certain
> -programs, particularly malloc debuggers, may consume lots of them,
> -e.g., up to one or two maps per allocation.
> -
> -The default value is 65536.
> -
> -=============================================================
> -
> memory_failure_early_kill:
>
> Control how to kill processes when uncorrected memory error (typically
> diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt
> index 6686bd267dc9..4a917f88cb11 100644
> --- a/Documentation/vm/ksm.txt
> +++ b/Documentation/vm/ksm.txt
> @@ -38,10 +38,6 @@ the range for whenever the KSM daemon is started; even if the range
> cannot contain any pages which KSM could actually merge; even if
> MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE.
>
> -If a region of memory must be split into at least one new MADV_MERGEABLE
> -or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process
> -will exceed vm.max_map_count (see Documentation/sysctl/vm.txt).
> -
> Like other madvise calls, they are intended for use on mapped areas of
> the user address space: they will report ENOMEM if the specified range
> includes unmapped gaps (though working on the intervening mapped areas),
> diff --git a/Documentation/vm/remap_file_pages.txt b/Documentation/vm/remap_file_pages.txt
> index f609142f406a..85985a89f05d 100644
> --- a/Documentation/vm/remap_file_pages.txt
> +++ b/Documentation/vm/remap_file_pages.txt
> @@ -21,7 +21,3 @@ systems are widely available.
> The syscall is deprecated and replaced it with an emulation now. The
> emulation creates new VMAs instead of nonlinear mappings. It's going to
> work slower for rare users of remap_file_pages() but ABI is preserved.
> -
> -One side effect of emulation (apart from performance) is that user can hit
> -vm.max_map_count limit more easily due to additional VMAs. See comment for
> -DEFAULT_MAX_MAP_COUNT for more details on the limit.
> diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
> index 83732fef510d..8e870b6e4ad9 100644
> --- a/fs/binfmt_elf.c
> +++ b/fs/binfmt_elf.c
> @@ -2227,10 +2227,6 @@ static int elf_core_dump(struct coredump_params *cprm)
> elf = kmalloc(sizeof(*elf), GFP_KERNEL);
> if (!elf)
> goto out;
> - /*
> - * The number of segs are recored into ELF header as 16bit value.
> - * Please check DEFAULT_MAX_MAP_COUNT definition when you modify here.
> - */
> segs = current->mm->map_count;
> segs += elf_core_extra_phdrs();
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ee073146aaa7..cf545264eb8b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -104,27 +104,6 @@ extern int mmap_rnd_compat_bits __read_mostly;
> #define mm_zero_struct_page(pp) ((void)memset((pp), 0, sizeof(struct page)))
> #endif
>
> -/*
> - * Default maximum number of active map areas, this limits the number of vmas
> - * per mm struct. Users can overwrite this number by sysctl but there is a
> - * problem.
> - *
> - * When a program's coredump is generated as ELF format, a section is created
> - * per a vma. In ELF, the number of sections is represented in unsigned short.
> - * This means the number of sections should be smaller than 65535 at coredump.
> - * Because the kernel adds some informative sections to a image of program at
> - * generating coredump, we need some margin. The number of extra sections is
> - * 1-3 now and depends on arch. We use "5" as safe margin, here.
> - *
> - * ELF extended numbering allows more than 65535 sections, so 16-bit bound is
> - * not a hard limit any more. Although some userspace tools can be surprised by
> - * that.
> - */
> -#define MAPCOUNT_ELF_CORE_MARGIN (5)
> -#define DEFAULT_MAX_MAP_COUNT (USHRT_MAX - MAPCOUNT_ELF_CORE_MARGIN)
> -
> -extern int sysctl_max_map_count;
> -
> extern unsigned long sysctl_user_reserve_kbytes;
> extern unsigned long sysctl_admin_reserve_kbytes;
>
> @@ -2134,8 +2113,6 @@ extern struct vm_area_struct *vma_merge(struct mm_struct *,
> unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
> struct mempolicy *, struct vm_userfaultfd_ctx);
> extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
> -extern int __split_vma(struct mm_struct *, struct vm_area_struct *,
> - unsigned long addr, int new_below);
> extern int split_vma(struct mm_struct *, struct vm_area_struct *,
> unsigned long addr, int new_below);
> extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 557d46728577..caced68ff0d0 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -110,6 +110,9 @@ extern int pid_max_min, pid_max_max;
> extern int percpu_pagelist_fraction;
> extern int latencytop_enabled;
> extern unsigned int sysctl_nr_open_min, sysctl_nr_open_max;
> +#ifdef CONFIG_MMU
> +static int sysctl_max_map_count = 65530; /* obsolete, kept for backwards compatibility */
> +#endif
> #ifndef CONFIG_MMU
> extern int sysctl_nr_trim_pages;
> #endif
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 375cf32087e4..f63834f59ca7 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -147,11 +147,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
> *prev = vma;
>
> if (start != vma->vm_start) {
> - if (unlikely(mm->map_count >= sysctl_max_map_count)) {
> - error = -ENOMEM;
> - goto out;
> - }
> - error = __split_vma(mm, vma, start, 1);
> + error = split_vma(mm, vma, start, 1);
> if (error) {
> /*
> * madvise() returns EAGAIN if kernel resources, such as
> @@ -164,11 +160,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
> }
>
> if (end != vma->vm_end) {
> - if (unlikely(mm->map_count >= sysctl_max_map_count)) {
> - error = -ENOMEM;
> - goto out;
> - }
> - error = __split_vma(mm, vma, end, 0);
> + error = split_vma(mm, vma, end, 0);
> if (error) {
> /*
> * madvise() returns EAGAIN if kernel resources, such as
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 924839fac0e6..e821d9c4395d 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1354,10 +1354,6 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
> if ((pgoff + (len >> PAGE_SHIFT)) < pgoff)
> return -EOVERFLOW;
>
> - /* Too many mappings? */
> - if (mm->map_count > sysctl_max_map_count)
> - return -ENOMEM;
> -
> /* Obtain the address to map to. we verify (or select) it and ensure
> * that it represents a valid section of the address space.
> */
> @@ -2546,11 +2542,11 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
> }
>
> /*
> - * __split_vma() bypasses sysctl_max_map_count checking. We use this where it
> - * has already been checked or doesn't make sense to fail.
> + * Split a vma into two pieces at address 'addr', a new vma is allocated
> + * either for the first part or the tail.
> */
> -int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
> - unsigned long addr, int new_below)
> +int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
> + unsigned long addr, int new_below)
> {
> struct vm_area_struct *new;
> int err;
> @@ -2612,19 +2608,6 @@ int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
> return err;
> }
>
> -/*
> - * Split a vma into two pieces at address 'addr', a new vma is allocated
> - * either for the first part or the tail.
> - */
> -int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
> - unsigned long addr, int new_below)
> -{
> - if (mm->map_count >= sysctl_max_map_count)
> - return -ENOMEM;
> -
> - return __split_vma(mm, vma, addr, new_below);
> -}
> -
> /* Munmap is split into 2 main parts -- this part which finds
> * what needs doing, and the areas themselves, which do the
> * work. This now handles partial unmappings.
> @@ -2665,15 +2648,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
> if (start > vma->vm_start) {
> int error;
>
> - /*
> - * Make sure that map_count on return from munmap() will
> - * not exceed its limit; but let map_count go just above
> - * its limit temporarily, to help free resources as expected.
> - */
> - if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count)
> - return -ENOMEM;
> -
> - error = __split_vma(mm, vma, start, 0);
> + error = split_vma(mm, vma, start, 0);
> if (error)
> return error;
> prev = vma;
> @@ -2682,7 +2657,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
> /* Does it split the last one? */
> last = find_vma(mm, end);
> if (last && end > last->vm_start) {
> - int error = __split_vma(mm, last, end, 1);
> + int error = split_vma(mm, last, end, 1);
> if (error)
> return error;
> }
> @@ -2694,7 +2669,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
> * will remain splitted, but userland will get a
> * highly unexpected error anyway. This is no
> * different than the case where the first of the two
> - * __split_vma fails, but we don't undo the first
> + * split_vma fails, but we don't undo the first
> * split, despite we could. This is unlikely enough
> * failure that it's not worth optimizing it for.
> */
> @@ -2915,9 +2890,6 @@ static int do_brk_flags(unsigned long addr, unsigned long request, unsigned long
> if (!may_expand_vm(mm, flags, len >> PAGE_SHIFT))
> return -ENOMEM;
>
> - if (mm->map_count > sysctl_max_map_count)
> - return -ENOMEM;
> -
> if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
> return -ENOMEM;
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 049470aa1e3e..5544dd3e6e10 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -278,13 +278,6 @@ static unsigned long move_vma(struct vm_area_struct *vma,
> bool need_rmap_locks;
>
> /*
> - * We'd prefer to avoid failure later on in do_munmap:
> - * which may split one vma into three before unmapping.
> - */
> - if (mm->map_count >= sysctl_max_map_count - 3)
> - return -ENOMEM;
> -
> - /*
> * Advise KSM to break any KSM pages in the area to be moved:
> * it would be confusing if they were to turn up at the new
> * location, where they happen to coincide with different KSM
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 17c00d93de2e..0f6d37be4797 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -1487,9 +1487,6 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
> if (vma->vm_file)
> return -ENOMEM;
>
> - if (mm->map_count >= sysctl_max_map_count)
> - return -ENOMEM;
> -
> region = kmem_cache_alloc(vm_region_jar, GFP_KERNEL);
> if (!region)
> return -ENOMEM;
> diff --git a/mm/util.c b/mm/util.c
> index 34e57fae959d..7e757686f186 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -516,7 +516,6 @@ EXPORT_SYMBOL_GPL(__page_mapcount);
> int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS;
> int sysctl_overcommit_ratio __read_mostly = 50;
> unsigned long sysctl_overcommit_kbytes __read_mostly;
> -int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
> unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */
> unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */
>
> --
> 2.13.6
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@xxxxxxxxxx For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>

--
Michal Hocko
SUSE Labs