Re: [PATCHv3 33/33] mm, x86: introduce PR_SET_MAX_VADDR and PR_GET_MAX_VADDR

From: Linus Torvalds
Date: Fri Feb 17 2017 - 15:02:32 EST


On Fri, Feb 17, 2017 at 6:13 AM, Kirill A. Shutemov
<kirill.shutemov@xxxxxxxxxxxxxxx> wrote:
> This patch introduces two new prctl(2) handles to manage maximum virtual
> address available to userspace to map.

So this is my least favorite patch of the whole series, for a couple of reasons:

(a) adding new code, and mixing it with the mindless TASK_SIZE ->
get_max_addr() conversion.

(b) what's the point of that whole TASK_SIZE vs get_max_addr() thing?
When use one, when the other?

so I think this patch needs a lot more thought and/or explanation.

Honestly, (a) is a no-brainer, and can be fixed by just splitting the
patch up. But I think (b) is more fundamental.

In particular, I think that get_max_addr() thing is badly defined.
When should you use TASK_SIZE, when should you use TASK_SIZE_MAX, and
when should you use get_max_addr()? I don't find that clear at all,
and I think that needs to be a whole lot more explicit and documented.

I also get he feeling that the whole thing is unnecessary. I'm
wondering if we should just instead say that the whole 47 vs 56-bit
virtual address is _purely_ about "get_unmapped_area()", and nothing
else.

IOW, I'm wondering if we can't just say that

- if the processor and kernel support 56-bit user address space, then
you can *always* use the whole space

- but by default, get_unmapped_area() will only return mappings that
fit in the 47 bit address space.

So if you use MAP_FIXED and give an address in the high range, it will
just always work, and the MM will always consider the task size to be
the full address space.

But for the common case where a process does no use MAP_FIXED, the
kernel will never give a high address by default, and you have to do
the process control thing to say "I want those high addresses".

Hmm?

In other words, I'd like to at least start out trying to keep the
differences between the 47-bit and 56-bit models as simple and minimal
as possible. Not make such a big deal out of it.

We already have "arch_get_unmapped_area()" that controls the whole
"what will non-MAP_FIXED mmap allocations return", so I'd hope that
the above kind of semantics could be done without *any* actual
TASK_SIZE changes _anywhere_ in the VM code.

Comments?

Linus