Re: [RFC 2/2] prctl: PR_SET_MM -- Introduce PR_SET_MM_MAP operation

From: Cyrill Gorcunov
Date: Fri Jul 11 2014 - 13:36:37 EST


On Wed, Jul 09, 2014 at 07:06:04PM +0400, Cyrill Gorcunov wrote:
>
> Thanks a lot for comments, Kees! I tend to agre, leaving off the @prctl_map
> variable out of macros should make code also shorter, I'll update that's
> not the problem. Could you please re-check if I'm not missing something
> in security aspects when time permits.

I suppse this one should look better.
---
From: Cyrill Gorcunov <gorcunov@xxxxxxxxxx>
Subject: prctl: PR_SET_MM -- Introduce PR_SET_MM_MAP operation, v2

During development of c/r we've noticed that in case if we need to
support user namespaces we face a problem with capabilities in
prctl(PR_SET_MM, ...) call, in particular once new user namespace
is created capable(CAP_SYS_RESOURCE) no longer passes.

A approach is to eliminate CAP_SYS_RESOURCE check but pass all
new values in one bundle, which would allow the kernel to make
more intensive test for sanity of values and same time allow us to
support checkpoint/restore of user namespaces.

Thus a new command PR_SET_MM_MAP introduced. It takes a pointer of
prctl_mm_map structure which carries all the members to be updated.

prctl(PR_SET_MM, PR_SET_MM_MAP, struct prctl_mm_map *, size)

struct prctl_mm_map {
__u64 start_code;
__u64 end_code;
__u64 start_data;
__u64 end_data;
__u64 start_brk;
__u64 brk;
__u64 start_stack;
__u64 arg_start;
__u64 arg_end;
__u64 env_start;
__u64 env_end;
__u64 *auxv;
__u32 auxv_size;
__u32 exe_fd;
};

All members except @exe_fd correspond ones of struct mm_struct.
To figure out which available values these members may take here
are meanings of the members.

- start_code, end_code: represent bounds of executable code area
- start_data, end_data: represent bounds of data area
- start_brk, brk: used to calculate bounds for brk() syscall
- start_stack: used when accounting space needed for command
line arguments, environment and shmat() syscall
- arg_start, arg_end, env_start, env_end: represent memory area
supplied for command line arguments and environment variables
- auxv, auxv_size: carries auxiliary vector, Elf format specifics
- exe_fd: file descriptor number for executable link (/proc/self/exe)

Thus we apply the following requirements to the values

1) Any member except @auxv, @auxv_size, @exe_fd is rather an address
in user space thus it must be laying inside [mmap_min_addr, mmap_max_addr)
interval.

2) While @[start|end]_code and @[start|end]_data may point to an nonexisting
VMAs (say a program maps own new .text and .data segments during execution)
the rest of members should belong to VMA which must exist.

3) Addresses must be ordered, ie @start_ member must not be greater or
equal to appropriate @end_ member.

4) As in regular Elf loading procedure we require that @start_brk and
@brk be greater than @end_data.

5) If RLIMIT_DATA rlimit is set to non-infinity new values should not
exceed existing limit. Same applies to RLIMIT_STACK.

6) Auxiliary vector size must not exceed existing one (which is
predefined as AT_VECTOR_SIZE and depends on architecture).

7) File descriptor passed in @exe_file should be pointing
to executable file (because we use existing prctl_set_mm_exe_file_locked
helper it ensures that the file we are going to use as exe link has all
required permission granted).

Now about where these members are involved inside kernel code:

- @start_code and @end_code are used in /proc/$pid/[stat|statm] output;

- @start_data and @end_data are used in /proc/$pid/[stat|statm] output,
also they are considered if there enough space for brk() syscall
result if RLIMIT_DATA is set;

- @start_brk shown in /proc/$pid/stat output and accounted in brk()
syscall if RLIMIT_DATA is set; also this member is tested to
find a symbolic name of mmap event for perf system (we choose
if event is generated for "heap" area); one more aplication is
selinux -- we test if a process has PROCESS__EXECHEAP permission
if trying to make heap area being executable with mprotect() syscall;

- @brk is a current value for brk() syscall which lays inside heap
area, it's shown in /proc/$pid/stat. When syscall brk() succesfully
provides new memory area to a user space upon brk() completion the
mm::brk is updated to carry new value;

Both @start_brk and @brk are actively used in /proc/$pid/maps
and /proc/$pid/smaps output to find a symbolic name "heap" for
VMA being scanned;

- @start_stack is printed out in /proc/$pid/stat and used to
find a symbolic name "stack" for task and threads in
/proc/$pid/maps and /proc/$pid/smaps output, and as the same
as with @start_brk -- perf system uses it for event naming.
Also kernel treat this member as a start address of where
to map vDSO pages and to check if there is enough space
for shmat() syscall;

- @arg_start, @arg_end, @env_start and @env_end are printed out
in /proc/$pid/stat. Another access to the data these members
represent is to read /proc/$pid/environ or /proc/$pid/cmdline.
Any attempt to read these areas kernel tests with access_process_vm
helper so a user must have enough rights for this action;

- @auxv and @auxv_size may be read from /proc/$pid/auxv. Strictly
speaking kernel doesn't care much about which exactly data is
sitting there because it is solely for userspace;

- @exe_fd is referred from /proc/$pid/exe and when generating
coredump. We uses prctl_set_mm_exe_file_locked helper to update
this member, so exe-file link modification remains one-shot
action.

Still note that updating exe-file link now doesn't require sys-resource
capability anymore, after all there is no much profit in preventing setup
own file link (there are a number of ways to execute own code -- ptrace,
ld-preload, so that the only reliable way to find which exactly code
is executed is to inspect running program memory).

I believe the old interface should be deprecated and ripped off
in a couple of kernel releases if no one against.

To test if new interface is implemented in the kernel one
can pass PR_SET_MM_MAP_SIZE opcode and the kernel returns
the size of currently supported struct prctl_mm_map.

v2:
- compact macros (by keescook@)
- wrap new code with CONFIG_ (by akpm@)

Signed-off-by: Cyrill Gorcunov <gorcunov@xxxxxxxxxx>
Cc: Kees Cook <keescook@xxxxxxxxxxxx>
Cc: Tejun Heo <tj@xxxxxxxxxx>
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Cc: Andrew Vagin <avagin@xxxxxxxxxx>
Cc: Eric W. Biederman <ebiederm@xxxxxxxxxxxx>
Cc: H. Peter Anvin <hpa@xxxxxxxxx>
Cc: Serge Hallyn <serge.hallyn@xxxxxxxxxxxxx>
Cc: Pavel Emelyanov <xemul@xxxxxxxxxxxxx>
Cc: Vasiliy Kulikov <segoon@xxxxxxxxxxxx>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx>
Cc: Michael Kerrisk <mtk.manpages@xxxxxxxxx>
---
include/uapi/linux/prctl.h | 25 +++++
kernel/sys.c | 194 ++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 218 insertions(+), 1 deletion(-)

Index: linux-2.6.git/include/uapi/linux/prctl.h
===================================================================
--- linux-2.6.git.orig/include/uapi/linux/prctl.h
+++ linux-2.6.git/include/uapi/linux/prctl.h
@@ -119,6 +119,31 @@
# define PR_SET_MM_ENV_END 11
# define PR_SET_MM_AUXV 12
# define PR_SET_MM_EXE_FILE 13
+# define PR_SET_MM_MAP 14
+# define PR_SET_MM_MAP_SIZE 15
+
+/*
+ * This structure provides new memory descriptor
+ * map which mostly modifies /proc/pid/stat[m]
+ * output for a task. This mostly done in a
+ * sake of checkpoint/restore functionality.
+ */
+struct prctl_mm_map {
+ __u64 start_code; /* code section bounds */
+ __u64 end_code;
+ __u64 start_data; /* data section bounds */
+ __u64 end_data;
+ __u64 start_brk; /* heap for brk() syscall */
+ __u64 brk;
+ __u64 start_stack; /* stack starts at */
+ __u64 arg_start; /* command line arguments bounds */
+ __u64 arg_end;
+ __u64 env_start; /* environment variables bounds */
+ __u64 env_end;
+ __u64 *auxv; /* auxiliary vector */
+ __u32 auxv_size; /* vector size */
+ __u32 exe_fd; /* /proc/$pid/exe link file */
+};

/*
* Set specific pid that is allowed to ptrace the current task.
Index: linux-2.6.git/kernel/sys.c
===================================================================
--- linux-2.6.git.orig/kernel/sys.c
+++ linux-2.6.git/kernel/sys.c
@@ -1687,6 +1687,191 @@ exit:
return err;
}

+#ifdef CONFIG_CHECKPOINT_RESTORE
+/*
+ * WARNING: we don't require any capability here so be very careful
+ * in what is allowed for modification from userspace.
+ */
+static int validate_prctl_map_locked(struct prctl_mm_map *prctl_map)
+{
+ unsigned long mmap_max_addr = TASK_SIZE;
+ struct mm_struct *mm = current->mm;
+ struct vm_area_struct *stack_vma;
+ unsigned long rlim;
+ int error = 0;
+
+ /*
+ * Make sure the members are not somewhere outside
+ * of allowed address space.
+ */
+#define __prctl_check_addr_space(__member) \
+ ({ \
+ int __rc; \
+ if ((unsigned long)prctl_map->__member < mmap_max_addr && \
+ (unsigned long)prctl_map->__member >= mmap_min_addr) \
+ __rc = 0; \
+ else \
+ __rc = -EINVAL; \
+ __rc; \
+ })
+ error |= __prctl_check_addr_space(start_code);
+ error |= __prctl_check_addr_space(end_code);
+ error |= __prctl_check_addr_space(start_data);
+ error |= __prctl_check_addr_space(end_data);
+ error |= __prctl_check_addr_space(start_stack);
+ error |= __prctl_check_addr_space(start_brk);
+ error |= __prctl_check_addr_space(brk);
+ error |= __prctl_check_addr_space(arg_start);
+ error |= __prctl_check_addr_space(arg_end);
+ error |= __prctl_check_addr_space(env_start);
+ error |= __prctl_check_addr_space(env_end);
+ if (error)
+ goto out;
+#undef __prctl_check_addr_space
+
+ /*
+ * Stack, brk, command line arguments and environment must exist.
+ */
+ stack_vma = find_vma(mm, (unsigned long)prctl_map->start_stack);
+ if (!stack_vma) {
+ error = -EINVAL;
+ goto out;
+ }
+#define __prctl_check_vma(__member) \
+ find_vma(mm, (unsigned long)prctl_map->__member) ? 0 : -EINVAL
+ error |= __prctl_check_vma(start_brk);
+ error |= __prctl_check_vma(brk);
+ error |= __prctl_check_vma(arg_start);
+ error |= __prctl_check_vma(arg_end);
+ error |= __prctl_check_vma(env_start);
+ error |= __prctl_check_vma(env_end);
+ if (error)
+ goto out;
+#undef __prctl_check_vma
+
+ /*
+ * Make sure the pairs are ordered.
+ */
+#define __prctl_check_order(__m1, __m2) \
+ ((unsigned long)prctl_map->__m2 > \
+ (unsigned long)prctl_map->__m1) ? 0 : -EINVAL
+ error |= __prctl_check_order(start_code, end_code);
+ error |= __prctl_check_order(start_data, end_data);
+ error |= __prctl_check_order(arg_start, arg_end);
+ error |= __prctl_check_order(env_start, env_end);
+ if (error)
+ goto out;
+#undef __prctl_check_order
+
+ error = -EINVAL;
+
+ /*
+ * @brk should be after @end_data in traditional maps.
+ */
+ if (prctl_map->start_brk <= prctl_map->end_data ||
+ prctl_map->brk <= prctl_map->end_data)
+ goto out;
+
+ /*
+ * Neither we should allow to override limits if they set.
+ */
+ rlim = rlimit(RLIMIT_DATA);
+ if (rlim < RLIM_INFINITY) {
+ if ((prctl_map->brk - prctl_map->start_brk) +
+ (prctl_map->end_data - prctl_map->start_data) > rlim)
+ goto out;
+ }
+
+ rlim = rlimit(RLIMIT_STACK);
+ if (rlim < RLIM_INFINITY) {
+#ifdef CONFIG_STACK_GROWSUP
+ unsigned long left = stack_vma->vm_end - prctl_map->start_stack;
+#else
+ unsigned long left = prctl_map->start_stack - stack_vma->vm_start;
+#endif
+ if (left > rlim)
+ goto out;
+ }
+
+ /*
+ * Someone is trying to cheat the auxv vector.
+ */
+ if (prctl_map->auxv && prctl_map->auxv_size > sizeof(mm->saved_auxv))
+ goto out;
+ error = 0;
+out:
+ return error;
+}
+
+static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data_size)
+{
+ struct prctl_mm_map prctl_map = { .exe_fd = (u32)-1, };
+ unsigned long user_auxv[AT_VECTOR_SIZE];
+ struct mm_struct *mm = current->mm;
+ int error = -EINVAL;
+
+ BUILD_BUG_ON(sizeof(user_auxv) != sizeof(mm->saved_auxv));
+
+ if (opt == PR_SET_MM_MAP_SIZE)
+ return put_user((unsigned int)sizeof(prctl_map),
+ (unsigned int __user *)addr);
+
+ if (data_size != sizeof(prctl_map))
+ return -EINVAL;
+
+ if (copy_from_user(&prctl_map, addr, sizeof(prctl_map)))
+ return -EFAULT;
+
+ down_read(&mm->mmap_sem);
+
+ if (validate_prctl_map_locked(&prctl_map))
+ goto out;
+
+ if (prctl_map.auxv && prctl_map.auxv_size) {
+ up_read(&mm->mmap_sem);
+ memset(user_auxv, 0, sizeof(user_auxv));
+ error = copy_from_user(user_auxv,
+ (const void __user *)prctl_map.auxv,
+ prctl_map.auxv_size);
+ down_read(&mm->mmap_sem);
+ if (error)
+ goto out;
+ }
+
+ if (prctl_map.exe_fd != (u32)-1) {
+ error = prctl_set_mm_exe_file_locked(mm, prctl_map.exe_fd);
+ if (error)
+ goto out;
+ }
+
+ if (prctl_map.auxv && prctl_map.auxv_size) {
+ user_auxv[AT_VECTOR_SIZE - 2] = 0;
+ user_auxv[AT_VECTOR_SIZE - 1] = 0;
+
+ task_lock(current);
+ memcpy(mm->saved_auxv, user_auxv, sizeof(user_auxv));
+ task_unlock(current);
+ }
+
+ mm->start_code = prctl_map.start_code;
+ mm->end_code = prctl_map.end_code;
+ mm->start_data = prctl_map.start_data;
+ mm->end_data = prctl_map.end_data;
+ mm->start_brk = prctl_map.start_brk;
+ mm->brk = prctl_map.brk;
+ mm->start_stack = prctl_map.start_stack;
+ mm->arg_start = prctl_map.arg_start;
+ mm->arg_end = prctl_map.arg_end;
+ mm->env_start = prctl_map.env_start;
+ mm->env_end = prctl_map.env_end;
+
+ error = 0;
+out:
+ up_read(&mm->mmap_sem);
+ return error;
+}
+#endif /* CONFIG_CHECKPOINT_RESTORE */
+
static int prctl_set_mm(int opt, unsigned long addr,
unsigned long arg4, unsigned long arg5)
{
@@ -1695,9 +1880,16 @@ static int prctl_set_mm(int opt, unsigne
struct vm_area_struct *vma;
int error;

- if (arg5 || (arg4 && opt != PR_SET_MM_AUXV))
+ if (arg5 || (arg4 && (opt != PR_SET_MM_AUXV &&
+ opt != PR_SET_MM_MAP &&
+ opt != PR_SET_MM_MAP_SIZE)))
return -EINVAL;

+#ifdef CONFIG_CHECKPOINT_RESTORE
+ if (opt == PR_SET_MM_MAP || opt == PR_SET_MM_MAP_SIZE)
+ return prctl_set_mm_map(opt, (const void __user *)addr, arg4);
+#endif
+
if (!capable(CAP_SYS_RESOURCE))
return -EPERM;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/