Re: [RFC 0/4] RFC: Add Checmate, BPF-driven minor LSM

From: Sargun Dhillon
Date: Thu Aug 04 2016 - 06:14:21 EST


On Thu, Aug 04, 2016 at 11:45:08AM +0200, Daniel Borkmann wrote:
> Hi Sargun,
>
> On 08/04/2016 09:11 AM, Sargun Dhillon wrote:
> [...]
> >[It's a] minor LSM. My particular use case is one in which containers are being
> >dynamically deployed to machines by internal developers in a different group.
> [...]
> >For many of these containers, the security policies can be fairly nuanced. One
> >particular one to take into account is network security. Often times,
> >administrators want to prevent ingress, and egress connectivity except from a
> >few select IPs. Egress filtering can be managed using net_cls, but without
> >modifying running software, it's non-trivial to attach a filter to all sockets
> >being created within a container. The inet_conn_request, socket_recvmsg,
> >socket_sock_rcv_skb hooks make this trivial to implement.
>
> I'm not too familiar with LSMs, but afaik, when you install such policies they
> are effectively global, right? How would you install/manage such policies per
> container?
>
> On a quick glance, this would then be the job of the BPF proglet from the global
> hook, no? If yes, then the BPF contexts the BPF prog works with seem rather quite
> limited ...
You're right. They are global hooks. If you'd want the policy to be specific to
a given cgroup, or namespace, you'd have to introduce a level of indirection
through a prog_array, or some such. There are still cases (the CVE, and Docker
bind) case where you want global isolation. The other big aspect is being able
to implement application-specific LSMs without requiring kmods. (A la hardchroot).

>
> +struct checmate_file_open_ctx {
> + struct file *file;
> + const struct cred *cred;
> +};
> +
> +struct checmate_task_create_ctx {
> + unsigned long clone_flags;
> +};
> +
> +struct checmate_task_free_ctx {
> + struct task_struct *task;
> +};
> +
> +struct checmate_socket_connect_ctx {
> + struct socket *sock;
> + struct sockaddr *address;
> + int addrlen;
> +};
>
> ... or are you using bpf_probe_read() in some way to walk 'current' to retrieve
> a namespace from there somehow? Like via nsproxy? But how you make sense of this
> for defining a per container policy?
In my prototype code, I'm using uts namespace + hostname, and I'm extracting
that via the bpf_probe_read walk. You're right, that's less than awesome. In the
longer-term, I'd hope we'd be able to add a helper like bpf_current_in_cgroup (a
la bpf_skb_in_cgroup). The idea is that we'd add enough helpers to avoid this. I
can submit some more example BPF programs if that'd help. Off the top of my
head:

* current_in_cgroup
* introduce struct pid map
* introduce helpers to inspect common datatypes passed to the helper -- if you
look at something like the the net hooks, there aren't actually that many
datatypes being passed around
* Introduce an example top-level cgroup that maps cgroup -> tail_call into
other programs

>
> Do you see a way where we don't need to define so many different ctx each time?
>
> My other concern from a security PoV is that when using things like bpf_probe_read()
> we're dependent on kernel structs and there's a risk that when people migrate such
> policies that expectations break due to underlying structs changed. I see you've
> addressed that in patch 4 to place a small stone in the way, yeah kinda works. It's
> mostly a reminder that this is not stable ABI.
>
> Thanks,
> Daniel