Re: [RFC][PATCHSET] mremap/mmap mess

From: Al Viro
Date: Tue Dec 08 2009 - 01:07:12 EST


On Mon, Dec 07, 2009 at 08:05:05PM +0000, Hugh Dickins wrote:

> mm/nommu.c is all about duplicating stuff with variations:
> unsatisfactory, but no reason to go do it differently here.
> Yes, I'll want to revert the util.c mods, but you don't need
> to do so now.

OK... BTW, I think I see how to get rid of the worst of expand_stack()
mess. Note that conceptually the nastiest part is execve() - there
we have no task_struct matching the mm we are accessing. But let's
take a look at what execve() is doing:
* we create a new mm
* we create a kinda-sorta vma at STACK_TOP_MAX
* we push argv/envp into it via get_user_pages(), populating
page tables for new mm as we go
* we set personality
* we possibly relocate it down
And all of that - to avoid the limit on number of pages caused by fixed-sized
array in bprm.

First of all, that implictly assumes that this relocation downwards is
rare. And so it is on amd64 and alpha. However, sparc64 and ppc64
have nearly 100% 32bit userland. That got to hurt and if the situation
with s390 is anywhere near that, they *really* hurt - we have variable
depth of page table tree there and forcing it up is Not Nice(tm).

Why do we want user_get_pages(), anyway? It's not that we lacked an
easy way to do large arrays, especially since the use is purely sequential.
Even a linked list of vmalloc'ed pages would do just fine (i.e. start with
static array in bprm, keep the pointer to last filled entry + number of
entries left before the next allocation; use the last pointer in array
for finding the next page-sized chunk).

What do we lose if we go that way? Inserting all these pages into mm
at once shouldn't be slower. Memory overhead is not really an issue
(one page per 511 or 1023 pages of argv). Am I missing something?

The benefit, AFAICS, is that we get rid of the mess with forced high
address use, get *sane* get_user_pages() (we always have matching
task_struct with the right personality, so we can avoid massive PITA
for doing checks right) and we get unified mmu/nommu code in fs/exec.c
out of that.

If you see serious problems I've missed, please tell. Otherwise I'm
going to hack up a prototype and post it here...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/