Re: [PATCH RFC net-next 03/14] bpf: introduce syscall(BPF, ...) and BPF maps

From: Alexei Starovoitov
Date: Sat Jun 28 2014 - 16:49:39 EST


On Sat, Jun 28, 2014 at 8:34 AM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
> On Fri, Jun 27, 2014 at 11:43 PM, Alexei Starovoitov <ast@xxxxxxxxxxxx> wrote:
>> On Fri, Jun 27, 2014 at 11:25 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>> On Fri, Jun 27, 2014 at 10:55 PM, Alexei Starovoitov <ast@xxxxxxxxxxxx> wrote:
>>>> On Fri, Jun 27, 2014 at 5:16 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>>> On Fri, Jun 27, 2014 at 5:05 PM, Alexei Starovoitov <ast@xxxxxxxxxxxx> wrote:
>>>>>> BPF syscall is a demux for different BPF releated commands.
>>>>>>
>>>>>> 'maps' is a generic storage of different types for sharing data between kernel
>>>>>> and userspace.
>>>>>>
>>>>>> The maps can be created/deleted from user space via BPF syscall:
>>>>>> - create a map with given id, type and attributes
>>>>>> map_id = bpf_map_create(int map_id, map_type, struct nlattr *attr, int len)
>>>>>> returns positive map id or negative error
>>>>>>
>>>>>> - delete map with given map id
>>>>>> err = bpf_map_delete(int map_id)
>>>>>> returns zero or negative error
>>>>>
>>>>> What's the scope of "id"? How is it secured?
>>>>
>>>> the map and program id space is global and it's cap_sys_admin only.
>>>> There is no pressing need to do it with per-user limits.
>>>> So the whole thing is root only for now.
>>>>
>>>
>>> Hmm. This may be unpleasant if you ever want to support non-root or
>>> namespaced operation.
>>
>> I think it will be easy to extend it per namespace when we lift
>> root-only restriction. It will be seamless without user api changes.
>>
>
> It might be seamless, but I'm not sure it'll be very useful. See below.
>
>>> How hard would it be to give these things fds?
>>
>> you mean programs/maps auto-terminate when creator process
>> exits? I thought about it and it's appealing at first glance, but
>> doesn't fit the model of existing tracepoint events which are global.
>> The programs attached to events need to live without 'daemon'
>> hanging around. Therefore I picked 'kernel module'- like method.
>
> Here are some things I'd like to be able to do:
>
> - Load an eBPF program and use it as a seccomp filter.
>
> - Create a read-only map and reference it from a seccomp filter.
>
> - Create a data structure that a seccomp filter can write but that
> the filtered process can only read.
>
> - Create a data structure that a seccomp filter can read but that
> some other trusted process can write.
>
> - Create a network filter of some sort and give permission to
> manipulate a list of ports to an otherwise untrusted process.
>
> The first four of these shouldn't require privilege.
>
> All of this fits nicely into a model where all of the eBPF objects
> (filters and data structures) are represented by fds. Read access to
> the fd lets you read (or execute eBPF programs). Write access to the
> fd lets you write. You can send them around naturally using
> SCM_RIGHTS, and you can create deprivileged versions by reopening the
> objects with less access.

Sorry I don't like 'fd' direction at all.
1. it will make the whole thing very socket specific and 'net' dependent.
but the goal here is to be able to use eBPF for tracing in embedded
setups. So it's gotta be net independent.
2. sockets are already overloaded with all sorts of stuff. Adding more
types of sockets will complicate it a lot.
3. and most important. read/write operations on sockets are not
done every nanosecond, whereas lookup operations on bpf maps
are done every dozen instructions, so we cannot have any overhead
when accessing maps.
In other words the verifier is done as static analyzer. I moved all
the complexity to verify time, so at run-time the programs are as
fast as possible. I'm strongly against run-time checks in critical path,
since they kill performance and make the whole approach a lot less usable.

What you want to achieve:
> - Load an eBPF program and use it as a seccomp filter.
> - Create a read-only map and reference it from a seccomp filter.
is very doable in the existing framework.
Note I didn't do seccomp+ebpf example, only because you and Kees
and messing with this part of code a lot and I didn't want to conflict.

> All of this *could* fit in using global ids, but we'd need to answer
> questions like "what namespace are they bound to" and "who has access
> to a given fd". I'd want to see that these questions *have* good
> answers before committing to this type of model. Keep in mind that,
> for seccomp in particular, granting access to a specific uid will be
> very limiting: part of the point of seccomp is to enable
> user-controlled finer-grained permissions than allowed by uids and
> gids.

filters(bpf programs) is a low level tool that shouldn't be aware
of gid/uids at all. Just like classic bpf doesn't care, eBPF programs
shouldn't care. Mixing concept of uids/fds into the program is wrong.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/