Re: [PATCH net-next] modules: allow modprobe load regular elf binaries

From: Kees Cook
Date: Thu Mar 08 2018 - 19:24:42 EST


How is this not marked [RFC]? :)

On Mon, Mar 5, 2018 at 5:34 PM, Alexei Starovoitov <ast@xxxxxxxxxx> wrote:
> As the first step in development of bpfilter project [1] the request_module()
> code is extended to allow user mode helpers to be invoked. Idea is that
> user mode helpers are built as part of the kernel build and installed as
> traditional kernel modules with .ko file extension into distro specified
> location, such that from a distribution point of view, they are no different
> than regular kernel modules. Thus, allow request_module() logic to load such
> user mode helper (umh) modules via:
>
> request_module("foo") ->
> call_umh("modprobe foo") ->
> sys_finit_module(FD of /lib/modules/.../foo.ko) ->
> call_umh(struct file)

Yikes. This means autoloaded artbitrary exec() now, instead of
autoloading just loading arbitrary modules? That seems extremely bad.
This just extends all the problems we've had with defining security
boundaries with modules out to umh too. I would need some major
convincing that this can be made safe.

It's easy for container to trigger arbitrary module loading, so if it
can find/use a similar one of the bugs we've seen in the past to
redirect the module loading we don't just get new module-created
attack surface added to the kernel, but we again get arbitrary ELF
running in the root ns (not in the container). And in the insane case
of a container with CAP_SYS_MODULE, this is a trivial escape without
even needing to build a special kernel module: the root ns, running
with all privileges runs an arbitrary ELF.

As it stands, I have to NAK this. At the very least, you need to solve
the execution environment problems here: the ELF should run with no
greater privileges than what loaded the module, and very importantly,
must not be allowed to bypass these checks through autoloading. What
_triggered_ the autoload must be the environment, not the "modprobe",
since that's running with full privileges. Basically, we need to fix
module autoloading first, or at least a significant subset of the
problems: https://lkml.org/lkml/2017/11/27/754

> Such approach enables kernel to delegate functionality traditionally done
> by kernel modules into user space processes (either root or !root) and
> reduces security attack surface of such new code, meaning in case of
> potential bugs only the umh would crash but not the kernel. Another
> advantage coming with that would be that bpfilter.ko can be debugged and
> tested out of user space as well (e.g. opening the possibility to run
> all clang sanitizers, fuzzers or test suites for checking translation).
> Also, such architecture makes the kernel/user boundary very precise:
> control plane is done by the user space while data plane stays in the kernel.
>
> It's easy to distinguish "umh module" from traditional kernel module:
>
> $ readelf -h bld/net/bpfilter/bpfilter.ko|grep Type
> Type: EXEC (Executable file)

As Andy asked earlier, why not DYN too to catch PIE executables? Seems
like forcing the userspace helper to be non-PIE would defeat some of
the userspace defenses in use in most distros.

> $ readelf -h bld/net/ipv4/netfilter/iptable_filter.ko |grep Type
> Type: REL (Relocatable file)
>
> Since umh can crash, can be oom-ed by the kernel, killed by admin,
> the subsystem that uses them (like bpfilter) need to manage life
> time of umh on its own, so module infra doesn't do any accounting
> of them. They don't appear in "lsmod" and cannot be "rmmod".
> Multiple request_module("umh") will load multiple umh.ko processes.
>
> Similar to kernel modules the kernel will be tainted if "umh module"
> has invalid signature.
>
> [1] https://lwn.net/Articles/747551/
>
> Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxx>
> ---
> fs/exec.c | 40 +++++++++++++++++++++++++++++++---------
> include/linux/binfmts.h | 1 +
> include/linux/umh.h | 3 +++
> kernel/module.c | 43 ++++++++++++++++++++++++++++++++++++++-----
> kernel/umh.c | 24 +++++++++++++++++++++---
> 5 files changed, 94 insertions(+), 17 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 7eb8d21bcab9..0483c438de7d 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1691,14 +1691,13 @@ static int exec_binprm(struct linux_binprm *bprm)
> /*
> * sys_execve() executes a new program.
> */
> -static int do_execveat_common(int fd, struct filename *filename,
> - struct user_arg_ptr argv,
> - struct user_arg_ptr envp,
> - int flags)
> +static int __do_execve_file(int fd, struct filename *filename,
> + struct user_arg_ptr argv,
> + struct user_arg_ptr envp,
> + int flags, struct file *file)
> {
> char *pathbuf = NULL;
> struct linux_binprm *bprm;
> - struct file *file;
> struct files_struct *displaced;
> int retval;
>
> @@ -1737,7 +1736,8 @@ static int do_execveat_common(int fd, struct filename *filename,
> check_unsafe_exec(bprm);
> current->in_execve = 1;
>
> - file = do_open_execat(fd, filename, flags);
> + if (!file)
> + file = do_open_execat(fd, filename, flags);
> retval = PTR_ERR(file);
> if (IS_ERR(file))
> goto out_unmark;
> @@ -1745,7 +1745,9 @@ static int do_execveat_common(int fd, struct filename *filename,
> sched_exec();
>
> bprm->file = file;
> - if (fd == AT_FDCWD || filename->name[0] == '/') {
> + if (!filename) {
> + bprm->filename = "/dev/null";
> + } else if (fd == AT_FDCWD || filename->name[0] == '/') {
> bprm->filename = filename->name;
> } else {
> if (filename->name[0] == '\0')
> @@ -1811,7 +1813,8 @@ static int do_execveat_common(int fd, struct filename *filename,
> task_numa_free(current);
> free_bprm(bprm);
> kfree(pathbuf);
> - putname(filename);
> + if (filename)
> + putname(filename);
> if (displaced)
> put_files_struct(displaced);
> return retval;
> @@ -1834,10 +1837,29 @@ static int do_execveat_common(int fd, struct filename *filename,
> if (displaced)
> reset_files_struct(displaced);
> out_ret:
> - putname(filename);
> + if (filename)
> + putname(filename);
> return retval;
> }
>
> +static int do_execveat_common(int fd, struct filename *filename,
> + struct user_arg_ptr argv,
> + struct user_arg_ptr envp,
> + int flags)
> +{
> + struct file *file = NULL;
> +
> + return __do_execve_file(fd, filename, argv, envp, flags, file);
> +}
> +
> +int do_execve_file(struct file *file, void *__argv, void *__envp)
> +{
> + struct user_arg_ptr argv = { .ptr.native = __argv };
> + struct user_arg_ptr envp = { .ptr.native = __envp };
> +
> + return __do_execve_file(AT_FDCWD, NULL, argv, envp, 0, file);
> +}

The exec.c changes should be split into a separate patch. Something
feels incorrectly refactored here, though. Passing all three of fd,
filename, and file to __do_execve_file() seems wrong; perhaps the fd
to file handling needs to happen externally in what you have here as
do_execveat_common()? The resulting __do_execve_file() has repeated
conditionals on filename... maybe what I object to is being able to
pass a NULL filename at all. The filename can be (painfully)
reconstructed from the struct file.

> [...]
> diff --git a/kernel/module.c b/kernel/module.c
> index ad2d420024f6..6cfa35795741 100644
> --- a/kernel/module.c
> +++ b/kernel/module.c
> [...]
> @@ -3870,14 +3896,21 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
> |MODULE_INIT_IGNORE_VERMAGIC))
> return -EINVAL;
>
> - err = kernel_read_file_from_fd(fd, &hdr, &size, INT_MAX,
> - READING_MODULE);
> + f = fdget(fd);
> + if (!f.file)
> + return -EBADF;
> +
> + err = kernel_read_file(f.file, &hdr, &size, INT_MAX, READING_MODULE);

For the LSM subsystem, I think this should also get it's own enum
kernel_read_file_id. This is really no longer a kernel module...

> if (err)
> - return err;
> + goto out;
> info.hdr = hdr;
> info.len = size;
> + info.file = f.file;
>
> - return load_module(&info, uargs, flags);
> + err = load_module(&info, uargs, flags);
> +out:
> + fdput(f);
> + return err;
> }
>
> static inline int within(unsigned long addr, void *start, unsigned long size)
> diff --git a/kernel/umh.c b/kernel/umh.c
> index 18e5fa4b0e71..4361c694bdb1 100644
> --- a/kernel/umh.c
> +++ b/kernel/umh.c
> @@ -97,9 +97,13 @@ static int call_usermodehelper_exec_async(void *data)
>
> commit_creds(new);
>
> - retval = do_execve(getname_kernel(sub_info->path),
> - (const char __user *const __user *)sub_info->argv,
> - (const char __user *const __user *)sub_info->envp);
> + if (sub_info->file)
> + retval = do_execve_file(sub_info->file,
> + sub_info->argv, sub_info->envp);
> + else
> + retval = do_execve(getname_kernel(sub_info->path),
> + (const char __user *const __user *)sub_info->argv,
> + (const char __user *const __user *)sub_info->envp);
> out:
> sub_info->retval = retval;
> /*
> @@ -393,6 +397,20 @@ struct subprocess_info *call_usermodehelper_setup(const char *path, char **argv,
> }
> EXPORT_SYMBOL(call_usermodehelper_setup);
>
> +struct subprocess_info *call_usermodehelper_setup_file(struct file *file)
> +{
> + struct subprocess_info *sub_info;
> +
> + sub_info = kzalloc(sizeof(struct subprocess_info), GFP_KERNEL);
> + if (!sub_info)
> + return NULL;
> +
> + INIT_WORK(&sub_info->work, call_usermodehelper_exec_work);
> + sub_info->path = "/dev/null";
> + sub_info->file = file;

This use of "/dev/null" here and in execve is just wrong. It _does_
have a path and filename...

Anyway, interesting idea. I think it _can_ work, it just needs much
much more careful security boundaries and to solve our autoloading
exposures too.

-Kees

--
Kees Cook
Pixel Security