[PATCH] mm: disable `vm.max_map_count' sysctl limit

From: Mikael Pettersson
Date: Sun Nov 26 2017 - 11:09:43 EST

Next message: Andy Lutomirski: "Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)"
Previous message: Marcos Paulo de Souza: "[PATCH -next] fork.c: Move check of clone NEWIPC and SYSVSEM to copy_process"
Next in thread: Michal Hocko: "Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

The `vm.max_map_count' sysctl limit is IMO useless and confusing, so
this patch disables it.

- Old ELF had a limit of 64K segments, making core dumps from processes
with more mappings than that problematic, but that was fixed in March
2010 ("elf coredump: add extended numbering support").

- There are no internal data structures sized by this limit, making it
entirely artificial.

- When viewed as a limit on memory consumption, it is ineffective since
the number of mappings does not correspond directly to the amount of
memory consumed, since each mapping is variable-length.

- Reaching the limit causes various memory management system calls to
fail with ENOMEM, which is a lie. Combined with the unpredictability
of the number of mappings in a process, especially when non-trivial
memory management or heavy file mapping is used, it can be difficult
to reproduce these events and debug them. It's also confusing to get
ENOMEM when you know you have lots of free RAM.

This limit was apparently introduced in the 2.1.80 kernel (first as a
compile-time constant, later changed to a sysctl), but I haven't been
able to find any description for it in Git or the LKML archives, so I
don't know what the original motivation was.

I've kept the kernel tunable to not break the API towards user-space,
but it's a no-op now. Also the distinction between split_vma() and
__split_vma() disappears, so they are merged.

Tested on x86_64 with Fedora 26 user-space. Also built an ARM NOMMU
kernel to make sure NOMMU compiles and links cleanly.

Signed-off-by: Mikael Pettersson <mikpelinux@xxxxxxxxx>
---
Documentation/sysctl/vm.txt | 17 +-------------
Documentation/vm/ksm.txt | 4 ----
Documentation/vm/remap_file_pages.txt | 4 ----
fs/binfmt_elf.c | 4 ----
include/linux/mm.h | 23 -------------------
kernel/sysctl.c | 3 +++
mm/madvise.c | 12 ++--------
mm/mmap.c | 42 ++++++-----------------------------
mm/mremap.c | 7 ------
mm/nommu.c | 3 ---
mm/util.c | 1 -
11 files changed, 13 insertions(+), 107 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index b920423f88cb..0fcb511d07e6 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -35,7 +35,7 @@ Currently, these files are in /proc/sys/vm:
- laptop_mode
- legacy_va_layout
- lowmem_reserve_ratio
-- max_map_count
+- max_map_count (unused, kept for backwards compatibility)
- memory_failure_early_kill
- memory_failure_recovery
- min_free_kbytes
@@ -400,21 +400,6 @@ The minimum value is 1 (1/1 -> 100%).

==============================================================

-max_map_count:
-
-This file contains the maximum number of memory map areas a process
-may have. Memory map areas are used as a side-effect of calling
-malloc, directly by mmap, mprotect, and madvise, and also when loading
-shared libraries.
-
-While most applications need less than a thousand maps, certain
-programs, particularly malloc debuggers, may consume lots of them,
-e.g., up to one or two maps per allocation.
-
-The default value is 65536.
-
-=============================================================
-
memory_failure_early_kill:

Control how to kill processes when uncorrected memory error (typically
diff --git a/Documentation/vm/ksm.txt b/Documentation/vm/ksm.txt
index 6686bd267dc9..4a917f88cb11 100644
--- a/Documentation/vm/ksm.txt
+++ b/Documentation/vm/ksm.txt
@@ -38,10 +38,6 @@ the range for whenever the KSM daemon is started; even if the range
cannot contain any pages which KSM could actually merge; even if
MADV_UNMERGEABLE is applied to a range which was never MADV_MERGEABLE.

-If a region of memory must be split into at least one new MADV_MERGEABLE
-or MADV_UNMERGEABLE region, the madvise may return ENOMEM if the process
-will exceed vm.max_map_count (see Documentation/sysctl/vm.txt).
-
Like other madvise calls, they are intended for use on mapped areas of
the user address space: they will report ENOMEM if the specified range
includes unmapped gaps (though working on the intervening mapped areas),
diff --git a/Documentation/vm/remap_file_pages.txt b/Documentation/vm/remap_file_pages.txt
index f609142f406a..85985a89f05d 100644
--- a/Documentation/vm/remap_file_pages.txt
+++ b/Documentation/vm/remap_file_pages.txt
@@ -21,7 +21,3 @@ systems are widely available.
The syscall is deprecated and replaced it with an emulation now. The
emulation creates new VMAs instead of nonlinear mappings. It's going to
work slower for rare users of remap_file_pages() but ABI is preserved.
-
-One side effect of emulation (apart from performance) is that user can hit
-vm.max_map_count limit more easily due to additional VMAs. See comment for
-DEFAULT_MAX_MAP_COUNT for more details on the limit.
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 83732fef510d..8e870b6e4ad9 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -2227,10 +2227,6 @@ static int elf_core_dump(struct coredump_params *cprm)
elf = kmalloc(sizeof(*elf), GFP_KERNEL);
if (!elf)
goto out;
- /*
- * The number of segs are recored into ELF header as 16bit value.
- * Please check DEFAULT_MAX_MAP_COUNT definition when you modify here.
- */
segs = current->mm->map_count;
segs += elf_core_extra_phdrs();

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ee073146aaa7..cf545264eb8b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -104,27 +104,6 @@ extern int mmap_rnd_compat_bits __read_mostly;
#define mm_zero_struct_page(pp) ((void)memset((pp), 0, sizeof(struct page)))
#endif

-/*
- * Default maximum number of active map areas, this limits the number of vmas
- * per mm struct. Users can overwrite this number by sysctl but there is a
- * problem.
- *
- * When a program's coredump is generated as ELF format, a section is created
- * per a vma. In ELF, the number of sections is represented in unsigned short.
- * This means the number of sections should be smaller than 65535 at coredump.
- * Because the kernel adds some informative sections to a image of program at
- * generating coredump, we need some margin. The number of extra sections is
- * 1-3 now and depends on arch. We use "5" as safe margin, here.
- *
- * ELF extended numbering allows more than 65535 sections, so 16-bit bound is
- * not a hard limit any more. Although some userspace tools can be surprised by
- * that.
- */
-#define MAPCOUNT_ELF_CORE_MARGIN (5)
-#define DEFAULT_MAX_MAP_COUNT (USHRT_MAX - MAPCOUNT_ELF_CORE_MARGIN)
-
-extern int sysctl_max_map_count;
-
extern unsigned long sysctl_user_reserve_kbytes;
extern unsigned long sysctl_admin_reserve_kbytes;

@@ -2134,8 +2113,6 @@ extern struct vm_area_struct *vma_merge(struct mm_struct *,
unsigned long vm_flags, struct anon_vma *, struct file *, pgoff_t,
struct mempolicy *, struct vm_userfaultfd_ctx);
extern struct anon_vma *find_mergeable_anon_vma(struct vm_area_struct *);
-extern int __split_vma(struct mm_struct *, struct vm_area_struct *,
- unsigned long addr, int new_below);
extern int split_vma(struct mm_struct *, struct vm_area_struct *,
unsigned long addr, int new_below);
extern int insert_vm_struct(struct mm_struct *, struct vm_area_struct *);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 557d46728577..caced68ff0d0 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -110,6 +110,9 @@ extern int pid_max_min, pid_max_max;
extern int percpu_pagelist_fraction;
extern int latencytop_enabled;
extern unsigned int sysctl_nr_open_min, sysctl_nr_open_max;
+#ifdef CONFIG_MMU
+static int sysctl_max_map_count = 65530; /* obsolete, kept for backwards compatibility */
+#endif
#ifndef CONFIG_MMU
extern int sysctl_nr_trim_pages;
#endif
diff --git a/mm/madvise.c b/mm/madvise.c
index 375cf32087e4..f63834f59ca7 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -147,11 +147,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
*prev = vma;

if (start != vma->vm_start) {
- if (unlikely(mm->map_count >= sysctl_max_map_count)) {
- error = -ENOMEM;
- goto out;
- }
- error = __split_vma(mm, vma, start, 1);
+ error = split_vma(mm, vma, start, 1);
if (error) {
/*
* madvise() returns EAGAIN if kernel resources, such as
@@ -164,11 +160,7 @@ static long madvise_behavior(struct vm_area_struct *vma,
}

if (end != vma->vm_end) {
- if (unlikely(mm->map_count >= sysctl_max_map_count)) {
- error = -ENOMEM;
- goto out;
- }
- error = __split_vma(mm, vma, end, 0);
+ error = split_vma(mm, vma, end, 0);
if (error) {
/*
* madvise() returns EAGAIN if kernel resources, such as
diff --git a/mm/mmap.c b/mm/mmap.c
index 924839fac0e6..e821d9c4395d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1354,10 +1354,6 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
if ((pgoff + (len >> PAGE_SHIFT)) < pgoff)
return -EOVERFLOW;

- /* Too many mappings? */
- if (mm->map_count > sysctl_max_map_count)
- return -ENOMEM;
-
/* Obtain the address to map to. we verify (or select) it and ensure
* that it represents a valid section of the address space.
*/
@@ -2546,11 +2542,11 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
}

/*
- * __split_vma() bypasses sysctl_max_map_count checking. We use this where it
- * has already been checked or doesn't make sense to fail.
+ * Split a vma into two pieces at address 'addr', a new vma is allocated
+ * either for the first part or the tail.
*/
-int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long addr, int new_below)
+int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long addr, int new_below)
{
struct vm_area_struct *new;
int err;
@@ -2612,19 +2608,6 @@ int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
return err;
}

-/*
- * Split a vma into two pieces at address 'addr', a new vma is allocated
- * either for the first part or the tail.
- */
-int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long addr, int new_below)
-{
- if (mm->map_count >= sysctl_max_map_count)
- return -ENOMEM;
-
- return __split_vma(mm, vma, addr, new_below);
-}
-
/* Munmap is split into 2 main parts -- this part which finds
* what needs doing, and the areas themselves, which do the
* work. This now handles partial unmappings.
@@ -2665,15 +2648,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
if (start > vma->vm_start) {
int error;

- /*
- * Make sure that map_count on return from munmap() will
- * not exceed its limit; but let map_count go just above
- * its limit temporarily, to help free resources as expected.
- */
- if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count)
- return -ENOMEM;
-
- error = __split_vma(mm, vma, start, 0);
+ error = split_vma(mm, vma, start, 0);
if (error)
return error;
prev = vma;
@@ -2682,7 +2657,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
/* Does it split the last one? */
last = find_vma(mm, end);
if (last && end > last->vm_start) {
- int error = __split_vma(mm, last, end, 1);
+ int error = split_vma(mm, last, end, 1);
if (error)
return error;
}
@@ -2694,7 +2669,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
* will remain splitted, but userland will get a
* highly unexpected error anyway. This is no
* different than the case where the first of the two
- * __split_vma fails, but we don't undo the first
+ * split_vma fails, but we don't undo the first
* split, despite we could. This is unlikely enough
* failure that it's not worth optimizing it for.
*/
@@ -2915,9 +2890,6 @@ static int do_brk_flags(unsigned long addr, unsigned long request, unsigned long
if (!may_expand_vm(mm, flags, len >> PAGE_SHIFT))
return -ENOMEM;

- if (mm->map_count > sysctl_max_map_count)
- return -ENOMEM;
-
if (security_vm_enough_memory_mm(mm, len >> PAGE_SHIFT))
return -ENOMEM;

diff --git a/mm/mremap.c b/mm/mremap.c
index 049470aa1e3e..5544dd3e6e10 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -278,13 +278,6 @@ static unsigned long move_vma(struct vm_area_struct *vma,
bool need_rmap_locks;

/*
- * We'd prefer to avoid failure later on in do_munmap:
- * which may split one vma into three before unmapping.
- */
- if (mm->map_count >= sysctl_max_map_count - 3)
- return -ENOMEM;
-
- /*
* Advise KSM to break any KSM pages in the area to be moved:
* it would be confusing if they were to turn up at the new
* location, where they happen to coincide with different KSM
diff --git a/mm/nommu.c b/mm/nommu.c
index 17c00d93de2e..0f6d37be4797 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1487,9 +1487,6 @@ int split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
if (vma->vm_file)
return -ENOMEM;

- if (mm->map_count >= sysctl_max_map_count)
- return -ENOMEM;
-
region = kmem_cache_alloc(vm_region_jar, GFP_KERNEL);
if (!region)
return -ENOMEM;
diff --git a/mm/util.c b/mm/util.c
index 34e57fae959d..7e757686f186 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -516,7 +516,6 @@ EXPORT_SYMBOL_GPL(__page_mapcount);
int sysctl_overcommit_memory __read_mostly = OVERCOMMIT_GUESS;
int sysctl_overcommit_ratio __read_mostly = 50;
unsigned long sysctl_overcommit_kbytes __read_mostly;
-int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
unsigned long sysctl_user_reserve_kbytes __read_mostly = 1UL << 17; /* 128MB */
unsigned long sysctl_admin_reserve_kbytes __read_mostly = 1UL << 13; /* 8MB */

--
2.13.6

Next message: Andy Lutomirski: "Re: [PATCH 05/23] x86, kaiser: unmap kernel from userspace page tables (core patch)"
Previous message: Marcos Paulo de Souza: "[PATCH -next] fork.c: Move check of clone NEWIPC and SYSVSEM to copy_process"
Next in thread: Michal Hocko: "Re: [PATCH] mm: disable `vm.max_map_count' sysctl limit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]