Re: get_arg_page() && ptr_size accounting

From: Kees Cook
Date: Tue Sep 11 2018 - 15:06:11 EST


Oh, I like this patch! This is much cleaner. Notes below...

On Tue, Sep 11, 2018 at 7:13 AM, Oleg Nesterov <oleg@xxxxxxxxxx> wrote:
> diff --git a/fs/exec.c b/fs/exec.c
> index 1ebf6e5..7804a5c 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -218,55 +218,10 @@ static struct page *get_arg_page(struct linux_binprm *bprm, unsigned long pos,
> if (ret <= 0)
> return NULL;
>
> - if (write) {
> - unsigned long size = bprm->vma->vm_end - bprm->vma->vm_start;
> - unsigned long ptr_size, limit;
> -
> - /*
> - * Since the stack will hold pointers to the strings, we
> - * must account for them as well.
> - *
> - * The size calculation is the entire vma while each arg page is
> - * built, so each time we get here it's calculating how far it
> - * is currently (rather than each call being just the newly
> - * added size from the arg page). As a result, we need to
> - * always add the entire size of the pointers, so that on the
> - * last call to get_arg_page() we'll actually have the entire
> - * correct size.
> - */
> - ptr_size = (bprm->argc + bprm->envc) * sizeof(void *);
> - if (ptr_size > ULONG_MAX - size)
> - goto fail;
> - size += ptr_size;
> -
> - acct_arg_size(bprm, size / PAGE_SIZE);
> -
> - /*
> - * We've historically supported up to 32 pages (ARG_MAX)
> - * of argument strings even with small stacks
> - */
> - if (size <= ARG_MAX)
> - return page;
> -
> - /*
> - * Limit to 1/4 of the max stack size or 3/4 of _STK_LIM
> - * (whichever is smaller) for the argv+env strings.
> - * This ensures that:
> - * - the remaining binfmt code will not run out of stack space,
> - * - the program will have a reasonable amount of stack left
> - * to work from.
> - */
> - limit = _STK_LIM / 4 * 3;
> - limit = min(limit, bprm->rlim_stack.rlim_cur / 4);
> - if (size > limit)
> - goto fail;
> - }
> + if (write)
> + acct_arg_size(bprm, vma_pages(bprm->vma));
>
> return page;
> -
> -fail:
> - put_page(page);
> - return NULL;
> }
>
> static void put_arg_page(struct page *page)
> @@ -410,11 +365,6 @@ static int bprm_mm_init(struct linux_binprm *bprm)
> if (!mm)
> goto err;
>
> - /* Save current stack limit for all calculations made during exec. */
> - task_lock(current->group_leader);
> - bprm->rlim_stack = current->signal->rlim[RLIMIT_STACK];
> - task_unlock(current->group_leader);
> -

I would prefer this hunk stay here since it will be more robust
against weird arch-specific things happening against rlim_stack. I had
to clean up some of these tests in arch code, so I'm nervous about
moving this further away. Here is before we call arch_bprm_mm_init(),
and I think it's better to do this as early as possible.

> err = __bprm_mm_init(bprm);
> if (err)
> goto err;
> @@ -492,6 +442,27 @@ static int count(struct user_arg_ptr argv, int max)
> return i;
> }
>
> +static int prepare_rlim_stack(struct linux_binprm *bprm)

How about collapsing this with:

bprm->argc = count(argv, MAX_ARG_STRINGS);
if ((retval = bprm->argc) < 0)
goto out;

bprm->envc = count(envp, MAX_ARG_STRINGS);
if ((retval = bprm->envc) < 0)
goto out;

and call it prepare_arg_count(struct linux_binprm *bprm,
struct user_arg_ptr argv,
struct user_arg_ptr envp)

> +{
> + unsigned long limit, ptr_size;
> +
> + task_lock(current->group_leader);
> + bprm->rlim_stack = current->signal->rlim[RLIMIT_STACK];
> + task_unlock(current->group_leader);
> +

Let's try to retain the comments here:

/*
* Limit to 1/4 of the max stack size or 3/4 of _STK_LIM
* (whichever is smaller) for the argv+env strings.
* This ensures that:
* - the remaining binfmt code will not run out of stack space,
* - the program will have a reasonable amount of stack left
* to work from.
*/
> + limit = _STK_LIM / 4 * 3;
> + limit = min(limit, bprm->rlim_stack.rlim_cur / 4);

and here:

/*
* We've historically supported up to 32 pages (ARG_MAX)
* of argument strings even with small stacks
*/
> + limit = max(limit, (unsigned long)ARG_MAX);
> + /* COMMENT */

This comment should likely be something like:

/*
* We must account for the size of all the argv and envp pointers to
* the argv and envp strings, since they will also take up space in
* the stack. They aren't stored until much later when we can't
* signal to the parent that the child has run out of stack space.
* Instead, calculate it here so it's possible to fail gracefully.
*/

> + ptr_size = (bprm->argc + bprm->envc) * sizeof(void *);

BTW, in re-reading create_elf_tables() and its calculation of "items",
I realize the above should actually include the trailing NULL pointers
and argc, so it should be:

ptr_size = (1 + bprm->argc + 1 + bprm->envc + 1) * sizeof(void *);

> + if (limit <= ptr_size)
> + return -E2BIG;
> + limit -= ptr_size;
> +
> + bprm->p_min = bprm->p - limit;
> + return 0;
> +}
> +
> /*
> * 'copy_strings()' copies argument/environment strings from the old
> * processes's memory to the new process's stack. The call to get_user_pages()
> @@ -527,6 +498,8 @@ static int copy_strings(int argc, struct user_arg_ptr argv,
> pos = bprm->p;
> str += len;
> bprm->p -= len;
> + if (bprm->p <= bprm->p_min)
> + goto out;
>
> while (len > 0) {
> int offset, bytes_to_copy;
> @@ -1801,6 +1774,10 @@ static int __do_execve_file(int fd, struct filename *filename,
> if (retval < 0)
> goto out;
>
> + retval = prepare_rlim_stack(bprm);
> + if (retval < 0)
> + goto out;
> +
> retval = copy_strings_kernel(1, &bprm->filename, bprm);
> if (retval < 0)
> goto out;
> diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> index c05f24f..423e8c1 100644
> --- a/include/linux/binfmts.h
> +++ b/include/linux/binfmts.h
> @@ -24,7 +24,7 @@ struct linux_binprm {
> struct page *page[MAX_ARG_PAGES];
> #endif
> struct mm_struct *mm;
> - unsigned long p; /* current top of mem */
> + unsigned long p, p_min; /* current top of mem */

Can you split this out to a separate line (with a new comment) to
avoid comment-confusion? Something like:

unsigned long p; /* current top of mem */
unsigned long p_min; /* the minimum allowed mem position */

> unsigned int
> /*
> * True after the bprm_set_creds hook has been called once
>

I've also spent some more time convincing myself again that there
aren't races between count(), copy_strings(), and create_elf_tables().
A malicious parent would only be able to zero-fill the stack of the
child, but not escape the counts that I can see.

-Kees

--
Kees Cook
Pixel Security