Re: [PATCH] procfs: Add mem_end to /proc/<pid>/stat

From: Andy Lutomirski
Date: Fri Nov 04 2016 - 10:22:14 EST


On Fri, Nov 4, 2016 at 6:14 AM, Christopher Covington
<cov@xxxxxxxxxxxxxx> wrote:
> Applications such as Just-In-Time (JIT) compilers, Checkpoint/Restore In
> Userspace (CRIU), and User Mode Linux (UML) need to know the highest
> virtual address, TASK_SIZE, to implement pointer tagging or make a first
> educated guess at where to find a large, unused region of memory.
> Unfortunately the currently available mechanisms for determining TASK_SIZE
> are either convoluted and potentially error-prone, such as making repeated
> munmap() calls and checking the return code,

Oh boy -- if you do this you are just asking to segfault.

> or make use of hard-coded
> assumptions that limit an application's portability across kernels with
> different Kconfig options and multiple architectures.
>
> Therefore, expose TASK_SIZE to userspace. While PAGE_SIZE is exposed to
> userspace via an auxiliary vector, that approach is not used for TASK_SIZE
> in case run-time alterations to the usable virtual address range are one
> day implemented, such as through an extension to prctl(PR_SET_MM) or a flag
> to clone. There is no prctl(PR_GET_MM). Instead such information is
> expected to come from /proc/<pid>/stat[m]. For the same extendability
> reason, use a per-pid proc entry rather than a system-wide entry like
> /proc/sys/vm/mmap_min_addr.

First, this should be in status, not stat, but that's moot because
TASK_SIZE is nonsensical as a task property on x86. And, as was
nicely covered yesterday at LPC, we already have too much of a mess in
/proc where per-mm properties are mixed up with per-task properties.
Can we make a point of not adding any new mm-related things to files
that are about the task?

But also, NAK for x86 if you look at TASK_SIZE:

TASK_SIZE is a mess and needs to go away completely -- only
TASK_SIZE_MAX makes any sense. If you want to ask "what the largest
address that could possibly be mapped in this mm", the answer is
2^47-1-PAGE_SIZE [1] on present CPUs. If you want a prctl to return
that, then adding one *might* make sense. OTOH it's a bit unclear
what happens if your task is migrated to a hypothetical future CPU
with more address bits.

If you're a 32-bit process on x86, you have zero high bits free
because the address limit is above 2^31-1.

If you're an x32 process, then (a) I'm surprised and (b) there might
be room for "what is the highest address that an mmap call done
without trickery would return". That could be added as well with a
suitably scary name in prctl. But this is still rather odd: x32
pointers are exactly 32 bits unless you write weird asm code to use
64-bit pointers, and you wouldn't do that because it defeats the whole
point of x32 which is to treat all pointers as exactly 32 bits. So an
x32 application should just hard-code 32 as the number of bits.

[1] That PAGE_SIZE offset has an interesting backstory involving some
overly clever Intel hardware designers and a root hole that, as far as
I know, affected every single x86_64 operating system.

--Andy