Re: [PATCH] fix vmaccnt at fork (Was Re: "heuristic overcommit"and fork())

From: Mel Gorman
Date: Mon Feb 16 2009 - 09:32:43 EST


On Fri, Feb 13, 2009 at 10:36:55AM +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 11 Feb 2009 20:26:32 +0100
> "David CHAMPELOVIER" <david@xxxxxxxxxxxxxxxx> wrote:
>
> > Hi,
> >
> > Recently, I was unable to fork() a 38 GB process on a system with 64 GB RAM
> > and no swap.
> > Having a look at the kernel source, I surprisingly found that in "heuristic
> > overcommit" mode, fork() always checks that there is enough memory to
> > duplicate process memory.
> >
> > As far as I know, overcommit was introduced in the kernel for several
> > reasons, and fork() was one of them, since applications often exec() just
> > after fork(). I know fork() is not the most judicious choice in this case,
> > but well, this is the way many applications are written.
> >
> > Moreover, I can read in the proc man page that in "heuristic overcommit
> > mode", "obvious overcommits of address space are refused". I do not think
> > fork() is an obvious overcommit, that's why I would expect fork() to be
> > always accepted in this mode.
> >
> > So, is there a reason why fork() checks for available memory in "heuristic
> > mode" ?
> >
>
> fork() is used for duplicate process and it means to duplicate memory space.
> Because of Copy-On-Write, the page will not be used acutally. But, it's not
> different from mmap() case.

Pretty much. At fork() time, you cannot know if the process is going to
exec or not.

> In that case, overcommit_guess compares
> requested size and size of free memory for all that we use demand paging.
> So, the behavior is not surprizing. For notifing the kernel can assume
> exec-is-called-after-fork, we may need some flags or paramater.
>

There already is one of sorts. Use madvise(MADV_DONTFORK) on the large
memory regions so they don't get copied. If that doesn't work, it means
the accounting for the VMAs is being done in the wrong order.

> But, hmm.., there is something strange, following. Mel, how do you think ?
> ==
>
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
>
> Vm accounting at fork() should use the same logic as mmap().
>

This alters semantics in a fairly subtle manner and I think would break
counters as well.

accountable_mapping() is used to determine if VM_ACCOUNT is set or
not. Once set, it gets accounted after that. even if the overcommit settings
change. Somewhat weirdly, the overcommit decisions at the time of mmap()
are reused at fork() even if the overcommit settings change. This is odd
behaviour and arguably your patch could fix this anomoly. However, it
would make more sense to me to recalculate if VM_ACCOUNT should be set
or not rather than what you do here.

I think this patch also has subtle breakage. We could do something like;

1. mmap(), VM_ACCOUNT not set due to overcommit settings
2. overcommit set to strict
3. fork()
4. check flags, note that VM_ACCOUNT would have been used, account but
VM_ACCOUNT is still not set
5. child exits, VM_ACCOUNT not set so reserves are not given back

So reserves can constantly go up and never down. It would require root but
a bad program could eventually push the reserves up to the size of physical
memory without any of the memory actually being used.


> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
> ---
> include/linux/mm.h | 2 ++
> kernel/fork.c | 3 ++-
> 2 files changed, 4 insertions(+), 1 deletion(-)
>
> Index: mmotm-2.6.29-Feb11/kernel/fork.c
> ===================================================================
> --- mmotm-2.6.29-Feb11.orig/kernel/fork.c
> +++ mmotm-2.6.29-Feb11/kernel/fork.c
> @@ -301,7 +301,8 @@ static int dup_mmap(struct mm_struct *mm
> continue;
> }
> charge = 0;
> - if (mpnt->vm_flags & VM_ACCOUNT) {
> + if (accountable_mapping(mpnt->vm_file, mpnt->vm_flags) &&
> + mpnt->vm_flags & VM_ACCOUNT) {
> unsigned int len = (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT;
> if (security_vm_enough_memory(len))
> goto fail_nomem;
> Index: mmotm-2.6.29-Feb11/include/linux/mm.h
> ===================================================================
> --- mmotm-2.6.29-Feb11.orig/include/linux/mm.h
> +++ mmotm-2.6.29-Feb11/include/linux/mm.h
> @@ -1047,6 +1047,8 @@ extern void free_bootmem_with_active_reg
> typedef int (*work_fn_t)(unsigned long, unsigned long, void *);
> extern void work_with_active_regions(int nid, work_fn_t work_fn, void *data);
> extern void sparse_memory_present_with_active_regions(int nid);
> +extern int accountable_mapping(struct file *file, unsigned int vmflags);
> +

accountable_mapping() is a static inline in mm/mmap.c so I'd be
surprised if this compiles.

> #endif /* CONFIG_ARCH_POPULATES_NODE_MAP */
>
> #if !defined(CONFIG_ARCH_POPULATES_NODE_MAP) && \
>
>

NAK.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/