Re: [PATCH v2 1/2] Limit dump_pipe program's permission to init for container

From: Andrei Vagin
Date: Fri Aug 05 2016 - 02:32:42 EST


On Tue, Aug 2, 2016 at 2:08 AM, Zhao Lei <zhaolei@xxxxxxxxxxxxxx> wrote:
> Currently when we set core_pattern to a pipe, the pipe program is
> forked by kthread running with root's permission, and write dumpfile
> into host's filesystem.
> Same thing happened for container, the dumper and dumpfile are also
> in host(not in container).
>
> It have following program:
> 1: Not consistent with file_type core_pattern
> When we set core_pattern to a file, the container will write dump
> into container's filesystem instead of host.
> 2: Not safe for privileged container
> In a privileged container, user can destroy host system by following
> command:
> # # In a container
> # echo "|/bin/dd of=/boot/vmlinuz" >/proc/sys/kernel/core_pattern
> # make_dump
>
> This patch switch dumper program's environment to init task, so, for
> container, dumper program have same environment with init task in
> container, which make dumper program put in container's filesystem, and
> write coredump into container's filesystem.
> The dumper's permission is also limited into subset of container's init
> process.

Does it change the current behavior? A pid namespace may be used for
sandboxes. For example, chrome uses it. In this case, we probably want
to collect core dumps from a root pid namespace.

If we are going to virtualize core_pattern relative to the pid
namespace, maybe it's better to make it optional for a pid namespace.
I mean it core_pattern is not set for the current pid namespace, we
can check a parent pid namespace and so on. A helper will be executed
in a pid namespace, where core_pattern is set.

After reading these patches, I think it may be a good idea to add one
more mode to handle core files, when a core file is sent to an
existing process instead of executing a usermode helpers. This mode
will cover of use-cases where the pipe mode is used now. And it looks
more secure, because in this case we control namespaces and credential
for the abort daemon (a process which handles core files).u

How it can be done. An abort daemon creates an abstract listening unix
socket and writes its name into /proc/sys/kernel/core_pattern. The
kernel saves this name and the network namespace. Then when any
process is crashed, the kernel creates a new unix socket and connect
it to the socket of the abort daemon and streams a core file into this
socket.

>
> Suggested-by: Eric W. Biederman <ebieyderm@xxxxxxxxxxxx>
> Suggested-by: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx>
>
> Signed-off-by: Zhao Lei <zhaolei@xxxxxxxxxxxxxx>
> ---
> fs/coredump.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++++-
> include/linux/binfmts.h | 1 +
> 2 files changed, 87 insertions(+), 1 deletion(-)
>
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 281b768..8511267 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -516,6 +516,8 @@ static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
> {
> struct file *files[2];
> struct coredump_params *cp = (struct coredump_params *)info->data;
> + struct task_struct *base_task;
> +
> int err = create_pipe_files(files, 0);
> if (err)
> return err;
> @@ -524,10 +526,79 @@ static int umh_pipe_setup(struct subprocess_info *info, struct cred *new)
>
> err = replace_fd(0, files[0], 0);
> fput(files[0]);
> + if (err)
> + return err;
> +
> /* and disallow core files too */
> current->signal->rlim[RLIMIT_CORE] = (struct rlimit){1, 1};
>
> - return err;
> + base_task = cp->base_task;
> + if (base_task) {
> + const struct cred *base_cred;
> +
> + /* Set fs_root to base_task */
> + spin_lock(&base_task->fs->lock);
> + set_fs_root(current->fs, &base_task->fs->root);
> + spin_unlock(&base_task->fs->lock);
> +
> + /* Set namespaces to base_task */
> + get_nsproxy(base_task->nsproxy);
> + switch_task_namespaces(current, base_task->nsproxy);

This task will continue running in the current pid namespace, because
switch_task_namespaces doesn't change a pid for the task. Ussualy, we
need to make setns+fork to switch a pid namespace.

> +
> + /* Set cgroup to base_task */
> + current->flags &= ~PF_NO_SETAFFINITY;
> + err = cgroup_attach_task_all(base_task, current);
> + if (err < 0)
> + return err;
> +
> + /* Set cred to base_task */
> + base_cred = get_task_cred(base_task);
> +

I think you can use prepare_kernel_cred here

> + new->uid = base_cred->uid;
> + new->gid = base_cred->gid;
> + new->suid = base_cred->suid;
> + new->sgid = base_cred->sgid;
> + new->euid = base_cred->euid;
> + new->egid = base_cred->egid;
> + new->fsuid = base_cred->fsuid;
> + new->fsgid = base_cred->fsgid;
> +
> + new->securebits = base_cred->securebits;
> +
> + new->cap_inheritable = base_cred->cap_inheritable;
> + new->cap_permitted = base_cred->cap_permitted;
> + new->cap_effective = base_cred->cap_effective;
> + new->cap_bset = base_cred->cap_bset;
> + new->cap_ambient = base_cred->cap_ambient;
> +
> + security_cred_free(new);
> +#ifdef CONFIG_SECURITY
> + new->security = NULL;
> +#endif
> + err = security_prepare_creds(new, base_cred, GFP_KERNEL);
> + if (err < 0) {
> + put_cred(base_cred);
> + return err;
> + }
> +
> + free_uid(new->user);
> + new->user = base_cred->user;
> + get_uid(new->user);
> +
> + put_user_ns(new->user_ns);
> + new->user_ns = base_cred->user_ns;
> + get_user_ns(new->user_ns);
> +
> + put_group_info(new->group_info);
> + new->group_info = base_cred->group_info;
> + get_group_info(new->group_info);
> +
> + put_cred(base_cred);
> +
> + validate_creds(new);
> + }
> +
> + return 0;
> }
>
> void do_coredump(const siginfo_t *siginfo)
> @@ -590,6 +661,7 @@ void do_coredump(const siginfo_t *siginfo)
>
> if (ispipe) {
> int dump_count;
> + struct task_struct *vinit_task;
> char **helper_argv;
> struct subprocess_info *sub_info;
>
> @@ -631,6 +703,14 @@ void do_coredump(const siginfo_t *siginfo)
> goto fail_dropcount;
> }
>
> + rcu_read_lock();
> + vinit_task = find_task_by_vpid(1);
> + rcu_read_unlock();
> + if (!vinit_task) {
> + printk(KERN_WARNING "failed getting init task info, skipping core dump\n");
> + goto fail_dropcount;
> + }
> +
> helper_argv = argv_split(GFP_KERNEL, cn.corename, NULL);
> if (!helper_argv) {
> printk(KERN_WARNING "%s failed to allocate memory\n",
> @@ -638,6 +718,10 @@ void do_coredump(const siginfo_t *siginfo)
> goto fail_dropcount;
> }
>
> + get_task_struct(vinit_task);
> +
> + cprm.base_task = vinit_task;
> +
> retval = -ENOMEM;
> sub_info = call_usermodehelper_setup(helper_argv[0],
> helper_argv, NULL, GFP_KERNEL,
> @@ -646,6 +730,7 @@ void do_coredump(const siginfo_t *siginfo)
> retval = call_usermodehelper_exec(sub_info,
> UMH_WAIT_EXEC);
>
> + put_task_struct(vinit_task);
> argv_free(helper_argv);
> if (retval) {
> printk(KERN_INFO "Core dump to |%s pipe failed\n",
> diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h
> index 314b3ca..0c9a72c 100644
> --- a/include/linux/binfmts.h
> +++ b/include/linux/binfmts.h
> @@ -59,6 +59,7 @@ struct linux_binprm {
>
> /* Function parameter for binfmt->coredump */
> struct coredump_params {
> + struct task_struct *base_task;
> const siginfo_t *siginfo;
> struct pt_regs *regs;
> struct file *file;
> --
> 1.8.5.1
>
>
>
> _______________________________________________
> Containers mailing list
> Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx
> https://lists.linuxfoundation.org/mailman/listinfo/containers