Re: [PATCH net-next 3/4] bpf: add support for persistent maps/progs

From: Thomas Graf
Date: Wed Oct 21 2015 - 14:35:02 EST


On 10/21/15 at 05:17pm, Daniel Borkmann wrote:
> On 10/20/2015 08:56 PM, Eric W. Biederman wrote:
> ...
> >Just FYI: Using a device for this kind of interface is pretty
> >much a non-starter as that quickly gets you into situations where
> >things do not work in containers. If someone gets a version of device
> >namespaces past GregKH it might be up for discussion to use character
> >devices.
>
> Okay, you are referring to this discussion here:
>
> http://thread.gmane.org/gmane.linux.kernel.containers/26760
>
> What had been mentioned earlier in this thread was to have a namespace
> pass-through facility enforced by device cgroups we have in the kernel,
> which is one out of various means used to enforce policy today by
> deployment systems such as docker, for example. But more below.
>
> I think this all depends on the kind of expectations we have, where all
> this is going. In the original proposal, it was agreed to have the
> operation that creates a node as 'capable(CAP_SYS_ADMIN)'-only (in the
> way like most of the rest of eBPF is restricted), and based on the use
> case we distribute such objects to unprivileged applications. But I
> understand that it seems the trend lately to lift eBPF restrictions at
> some point anyway, and thus the CAP_SYS_ADMIN is suddenly irrelevant
> again. Fair enough.
>
> Don't get me wrong, I really don't mind if it will be some version of
> this fs patch or whatever architecture else we find consensus on, I
> think this discussion is merely trying to evaluate/discuss on what seems
> to be a good fit, also in terms of future requirements and integration.
>
> So far, during this discussion, it was proposed to modify the file system
> to a single-mount one and to stick this under /sys/kernel/bpf/. This
> will not have "real" namespace support either, but it was proposed to
> have a following structure:
>
> /sys/kernel/bpf/username/<optional_dirs_mkdir_by_user>/progX

This would probably work as you would typically map the ebpf map
using -v like this to give a stable path:

docker run -v /sys/kernel/bpf/foo/maps/progX:/map proX

> So, the file system will have kind of a user home-directory for each user
> to isolate through permissions, if I understood correctly.
>
> If we really want to go this route, then I think there are no big stones
> in the way for the other model either. It should look roughly drafted like
> the below.
>
> Together with device cgroups for containers, it would allow scenarios where
> you can have:
>
> * eBPF (map/prog) device pass-through so a map/prog could even be shared out
> from the initial namespace into individual ones/all (one could possibly
> extend such maps as read-only for these consumers).
> * eBPF device creation for unprivileged users with permissions being set
> accordingly (as in fs case).
> * Since cgroup controller can also do wildcards on major/minors, we could
> make that further fine-grained.
> * eBPF device creation can also be enforced by the cgroup controller to be
> entirely disallowed for a specific container.
>
> (An admin can determine the dynamically created major f.e. under /proc/devices.)

I've read the discussion passively and my take away is that, frankly,
I think the differences are somewhat minor. Both architectures can
scale to what we need. Both will do the job. I'm slightly worried about
exposing uAPI as a FS, I think that didn't work too well for sysfs. It's
pretty much a define the format once and never touch it again kind of
deal.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/