RE: [PATCH] Make core_pattern support namespace

From: Zhao Lei
Date: Thu Feb 18 2016 - 06:14:00 EST


Hi, Mateusz Guzik

Thanks for your detailed comment and suggestion on this patch.

> -----Original Message-----
> From: Mateusz Guzik [mailto:mguzik@xxxxxxxxxx]
> Sent: Tuesday, February 16, 2016 10:26 PM
> To: Zhao Lei <zhaolei@xxxxxxxxxxxxxx>
> Cc: linux-kernel@xxxxxxxxxxxxxxx; containers@xxxxxxxxxxxxxxxxxxxxxxxxxx
> Subject: Re: [PATCH] Make core_pattern support namespace
>
> On Tue, Feb 16, 2016 at 07:33:39PM +0800, Zhao Lei wrote:
> > Currently, each container shared one copy of coredump setting
> > with the host system, if host system changed the setting, each
> > running containers will be affected.
> >
> > Moreover, it is not easy to let each container keeping their own
> > coredump setting.
> >
> > We can use some workaround as pipe program to make the second
> > requirement possible, but it is not simple, and both host and
> > container are limited to set to fixed pipe program.
> > In one word, for host running contailer, we can't change core_pattern
> > anymore.
> > To make the problem more hard, if a host running more than one
> > container product, each product will try to snatch the global
> > coredump setting to fit their own requirement.
> >
> > For container based on namespace design, it is good to allow
> > each container keeping their own coredump setting.
> >
> > It will bring us following benefit:
> > 1: Each container can change their own coredump setting
> > based on operation on /proc/sys/kernel/core_pattern
> > 2: Coredump setting changed in host will not affect
> > running containers.
> > 3: Support both case of "putting coredump in guest" and
> > "putting curedump in host".
> >
> > Each namespace-based software(lxc, docker, ..) can use this function
> > to custom their dump setting.
> >
> > And this function makes each continer working as separate system,
> > it fit for design goal of namespace.
> >
>
> Sorry if this is a false alarm, I don't have easy means to test it, but
> is not this an immediate privilege escalation?
>
> In particular, if the new pattern was to include a pipe, would not it
> cause the kernel to run whatever the namespace put in there, but on the
> "host"?
>
In my mind, if user don't run vm in with privilege request,
the container manager can do following processing:
1: User set custom core_pattern in starting container, as:
# lxc-start --core-pattern='xxx' -n vm01
or
# docker run --core-pattern='xxx' my_image
or
set CORE_PATTERN in vm's config file
2: Container manager(lxc or docker) mount /proc as rw in init period,
and set core_pattern.
3: Container manager remount /proc as ro, and give vm to user.
User/program in vm can not change core_pattern setting, just as lxc
and docker's current default behavor.

If user run vm with privilege request, it will be a problem.
User/process in vm can change core-pattern setting, if change to a file,
it is not a problem, because it is a file in vm's filesystem.
But if changed to a pipe, it is a problem in current kernel design,
the pipe program is in host's filesystem, it means the vm can change host.

In short:
Run vm in non-privilege | do anything | no problem
Run vm in privilege | change core_pattern to file | no problem
Run vm in privilege | change core_pattern to pipe | IS problem

As a solution, maybe we can modify kernel to search the pipe program
in vm's filesystem, and the pipe problem runs in vm's fs context,
so a vm can only put dump file in its private filesystem, except vm
manager link the dir into host.

Another solution is to only allow vm change core_pattern to file(avoid pipe),
but is looks strange, the vm should have same core_dump setting function
with normal system.

> The feature definitely looks useful, but this is another privileged
> operation which is now made available to unprivileged users. This poses
> some questions:
> - should the namespace be allowed to set anything it wants, including
> pipes? I would argue the latter should be disallowed for simplicity
Yes, current patch does allow, it can be fixed in above way.

> - now that the pattern can change at any time by namespace root, it
> becomes fishy to parse it for filename generation without any
> protection against concurrent modification
Before this patch, kernel can works corrent, so I think kernel already
have lock to avoid changing and reading the cure_pattern buffer in
same time.
This patch only modify the buffer place, so the existence lock will
still can work, I'll confirm it in source.

> - how do you ensure that namespaces which did not explicitely set their
> pattern still get updates from the host?
>
> That said, I suggest adding an allocated buffer to hold it or be NULL
> otherwise. If the pointer is NULL, the "host" pattern is used.
>
> For dumping purposes the code could just kmalloc, rcu_read_lock, memcpy
> and be done with it. Or, if it is acceptable given the size, just have a
> buffer on the stack.
>
> Finally, the code setting it could simply xchg the pointer to avoid
> locks if there is no good lock to be taken here, and then kfree_rcu the
> old buffer.
>
Actually I considered about this way in doing patch, as:
HOST core_pattern=foo
+- VM1 core_pattern=NULL
+- VM1_1 core_pattern=NULL
VM1 and VM1_1 use nearext core_pattern setting in tree(foo).

And if VM1_1 changed core pattern to bar:
HOST core_pattern=foo
+- VM1 core_pattern=NULL
+- VM1_1 core_pattern=bar
VM1 use foo and VM1_1 use bar.

It means vm can always set core_pattern buffer pointer to NULL in creating,
until someone changes it.
But I finally selected to clone core_pattern in creating vm, because if we
see a vm as a independent system, its core_pattern should only changed by
process running in it, and after a vm got created, the host will not able
to change vm's internal setting, except login/run_command in vm's context.

Thanks
Zhaolei

> Just my $0,03.
>
> --
> Mateusz Guzik