Re: [PATCH 0/2] namespaces: log namespaces per task

From: Serge Hallyn
Date: Fri May 02 2014 - 17:00:53 EST


Quoting Richard Guy Briggs (rgb@xxxxxxxxxx):
> On 14/05/02, Serge E. Hallyn wrote:
> > Quoting Richard Guy Briggs (rgb@xxxxxxxxxx):
> > > I saw no replies to my questions when I replied a year after Aris' posting, so
> > > I don't know if it was ignored or got lost in stale threads:
> > > https://www.redhat.com/archives/linux-audit/2013-March/msg00020.html
> > > https://www.redhat.com/archives/linux-audit/2013-March/msg00033.html
> > > (https://lists.linux-foundation.org/pipermail/containers/2013-March/032063.html)
> > > https://www.redhat.com/archives/linux-audit/2014-January/msg00180.html
> > >
> > > I've tried to answer a number of questions that were raised in that thread.
> > >
> > > The goal is not quite identical to Aris' patchset.
> > >
> > > The purpose is to track namespaces in use by logged processes from the
> > > perspective of init_*_ns. The first patch defines a function to list them.
> > > The second patch provides an example of usage for audit_log_task_info() which
> > > is used by syscall audits, among others. audit_log_task() and
> > > audit_common_recv_message() would be other potential use cases.
> > >
> > > Use a serial number per namespace (unique across one boot of one kernel)
> > > instead of the inode number (which is claimed to have had the right to change
> > > reserved and is not necessarily unique if there is more than one proc fs). It
> > > could be argued that the inode numbers have now become a defacto interface and
> > > can't change now, but I'm proposing this approach to see if this helps address
> > > some of the objections to the earlier patchset.
> > >
> > > There could also have messages added to track the creation and the destruction
> > > of namespaces, listing the parent for hierarchical namespaces such as pidns,
> > > userns, and listing other ids for non-hierarchical namespaces, as well as other
> > > information to help identify a namespace.
> > >
> > > There has been some progress made for audit in net namespaces and pid
> > > namespaces since this previous thread. net namespaces are now served as peers
> > > by one auditd in the init_net namespace with processes in a non-init_net
> > > namespace being able to write records if they are in the init_user_ns and have
> > > CAP_AUDIT_WRITE. Processes in a non-init_pid_ns can now similarly write
> > > records. As for CAP_AUDIT_READ, I just posted a patchset to check capabilities
> > > of userspace processes that try to join netlink broadcast groups.
> > >
> > >
> > > Questions:
> > > Is there a way to link serial numbers of namespaces involved in migration of a
> > > container to another kernel? (I had a brief look at CRIU.) Is there a unique
> > > identifier for each running instance of a kernel? Or at least some identifier
> > > within the container migration realm?
> >
> > Eric Biederman has always been adamantly opposed to adding new namespaces
> > of namespaces, so the fact that you're asking this question concerns me.
>
> I have seen that position and I don't fully understand the justification
> for it other than added complexity.
>
> One way that occured to me to be able to identify a kernel instance was
> to look at CPU serial numbers or other CPU entity intended to be
> globally unique, but that isn't universally available.

That's one issue, which is uniqueness of namespaces cross-machines.

But it gets worse if we consider that after allowing in-container audit,
we'll have a nested container running, then have the parent container
migrated to another host (or just checkpointed and restarted); Now the
nexted container's indexes will all be changed. Is there any way audit
can track who's who after the migration?

That's not an indictment of the serial # approach, since (a) we don't
have in-container audit yet and (b) we don't have c/r/migration of nested
containers. But it's worth considering whether we can solve the issue
with serial #s, and, if not, whether we can solve it with any other
approach.

I guess one approach to solve it would be to allow userspace to request
a next serial #. Which will immediately lead us to a namespace of serial
#s (since the requested # might be lower than the last used one on the
new host).

As you've said inode #s for /proc/self/ns/* probably aren't sufficiently
unique, though perhaps we could attach a generation # for the sake of
audit. Then after a c/r/migration the generation # may be different,
but we may have a better shot at at least using the same ino#.

> Another possibility was RTC reading at time of boot, but that isn't good
> enough either.
>
> Both are dubious in VMs anyways.
>
> > The way things are right now, since audit belongs to the init userns,
> > we can get away with saying if a container 'migrates', the new kernel
> > will see a different set of serials, and noone should care. However,
> > if we're going to be allowing containers to have their own audit
> > namespace/layer/whatever, then this becomes more of a concern.
>
> Having a container have its own audit daemon (partitionned appropriately
> in the kernel) would be a long-term goal.

Agreed, fwiw.

> > That said, I'll now look at the patches while pretending that problem
> > does not exist :) If I ack, it'll be on correctness of the code, but
> > we'll still have to deal with this issue.
>
> Getting some discussion about this migration challenge was a significant
> motivation for posting this patch, so I'm hoping others will weigh in.
>
> Thanks for your review, Serge.
>
> > > What additional events should list this information?
> > >
> > > Does this present any kind of information leak? Only CAP_AUDIT_CONTROL (and
> > > proposed CAP_AUDIT_READ) in init_user_ns can get to this information in the
> > > init namespace at the moment.
> > >
> > >
> > > Proposed output format:
> > > This differs slightly from Aristeu's patch because of the label conflict with
> > > "pid=" due to including it in existing records rather than it being a seperate
> > > record:
> > > type=SYSCALL msg=audit(1398112249.996:65): arch=c000003e syscall=272 success=yes exit=0 a0=40000000 a1=ffffffffffffffff a2=0 a3=22 items=0 ppid=1 pid=566 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="(t-daemon)" exe="/usr/lib/systemd/systemd" mntns=5 netns=97 utsns=2 ipcns=1 pidns=4 userns=3 subj=system_u:system_r:init_t:s0 key=(null)
> > >
> > >
> > > Note: This set does not try to solve the non-init namespace audit messages and
> > > auditd problem yet. That will come later, likely with additional auditd
> > > instances running in another namespace with a limited ability to influence the
> > > master auditd. I echo Eric B's idea that messages destined for different
> > > namespaces would have to be tailored for that namespace with references that
> > > make sense (such as the right pid number reported to that pid namespace, and
> > > not leaking info about parents or peers).
> > >
> > >
> > > Richard Guy Briggs (2):
> > > namespaces: give each namespace a serial number
> > > audit: log namespace serial numbers
> > >
> > > fs/mount.h | 1 +
> > > fs/namespace.c | 1 +
> > > include/linux/audit.h | 7 +++++++
> > > include/linux/ipc_namespace.h | 1 +
> > > include/linux/nsproxy.h | 8 ++++++++
> > > include/linux/pid_namespace.h | 1 +
> > > include/linux/user_namespace.h | 1 +
> > > include/linux/utsname.h | 1 +
> > > include/net/net_namespace.h | 1 +
> > > init/version.c | 1 +
> > > ipc/msgutil.c | 1 +
> > > ipc/namespace.c | 2 ++
> > > kernel/audit.c | 38 ++++++++++++++++++++++++++++++++++++++
> > > kernel/nsproxy.c | 24 ++++++++++++++++++++++++
> > > kernel/pid.c | 1 +
> > > kernel/pid_namespace.c | 2 ++
> > > kernel/user.c | 1 +
> > > kernel/user_namespace.c | 2 ++
> > > kernel/utsname.c | 2 ++
> > > net/core/net_namespace.c | 4 +++-
> > > 20 files changed, 99 insertions(+), 1 deletions(-)
> > >
> > > _______________________________________________
> > > Containers mailing list
> > > Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx
> > > https://lists.linuxfoundation.org/mailman/listinfo/containers
>
> - RGB
>
> --
> Richard Guy Briggs <rbriggs@xxxxxxxxxx>
> Senior Software Engineer, Kernel Security, AMER ENG Base Operating Systems, Red Hat
> Remote, Ottawa, Canada
> Voice: +1.647.777.2635, Internal: (81) 32635, Alt: +1.613.693.0684x3545
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/