Re: [PATCH RFC net-next 03/14] bpf: introduce syscall(BPF, ...) and BPF maps

From: Andy Lutomirski
Date: Wed Jul 02 2014 - 21:44:15 EST


On Tue, Jul 1, 2014 at 10:33 PM, Alexei Starovoitov <ast@xxxxxxxxxxxx> wrote:
> On Tue, Jul 1, 2014 at 8:11 AM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>> On Mon, Jun 30, 2014 at 10:47 PM, Alexei Starovoitov <ast@xxxxxxxxxxxx> wrote:
>>> On Mon, Jun 30, 2014 at 3:09 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>> On Sat, Jun 28, 2014 at 11:36 PM, Alexei Starovoitov <ast@xxxxxxxxxxxx> wrote:
>>>>> On Sat, Jun 28, 2014 at 6:52 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>>>> On Sat, Jun 28, 2014 at 1:49 PM, Alexei Starovoitov <ast@xxxxxxxxxxxx> wrote:
>>>>>>>
>>>>>>> Sorry I don't like 'fd' direction at all.
>>>>>>> 1. it will make the whole thing very socket specific and 'net' dependent.
>>>>>>> but the goal here is to be able to use eBPF for tracing in embedded
>>>>>>> setups. So it's gotta be net independent.
>>>>>>> 2. sockets are already overloaded with all sorts of stuff. Adding more
>>>>>>> types of sockets will complicate it a lot.
>>>>>>> 3. and most important. read/write operations on sockets are not
>>>>>>> done every nanosecond, whereas lookup operations on bpf maps
>>>>>>> are done every dozen instructions, so we cannot have any overhead
>>>>>>> when accessing maps.
>>>>>>> In other words the verifier is done as static analyzer. I moved all
>>>>>>> the complexity to verify time, so at run-time the programs are as
>>>>>>> fast as possible. I'm strongly against run-time checks in critical path,
>>>>>>> since they kill performance and make the whole approach a lot less usable.
>>>>>>
>>>>>> I may have described my suggestion poorly. I'm suggesting that all of
>>>>>> these global ids be replaced *for userspace's benefit* with fds. That
>>>>>> is, a map would have an associated struct inode, and, when you load an
>>>>>> eBPF program, you'd pass fds into the kernel instead of global ids.
>>>>>> The kernel would still compile the eBPF program to use the global ids,
>>>>>> though.
>>>>>
>>>>> Hmm. If I understood you correctly, you're suggesting to do it similar
>>>>> to ipc/mqueue, shmem, sockets do. By registering and mounting
>>>>> a file system and providing all superblock and inode hooksâ and
>>>>> probably have its own namespace typeâ hmmâ may be. That's
>>>>> quite a bit of work to put lightly. As I said in the other email the first
>>>>> step is root only and all these complexity just not worth doing
>>>>> at this stage.
>>>>
>>>> The downside of not doing it right away is that it's harder to
>>>> retrofit in without breaking early users.
>>>>
>>>> You might be able to get away with using anon_inodes. That will
>>>
>>> Spent quite a bit of time playing with anon_inode_getfd(). The model
>>> works ok for seccomp, but doesn't seem to work for tracing,
>>> since tracepoints are global. Say, syscall(bpf, load_prog) returns
>>> a process-local fd. This 'fd' as a string can be written to
>>> debugfs/tracing/events/.../filter which will increment a refcnt of a global
>>> ebpf_program structure and will keep using it. When process exits it will
>>> close all fds which in case of ebpf_prog_fd should be a nop, since
>>> the program is still attached to a global event. Now we have a
>>> program and maps that still alive and dangling, since tracepoint events
>>> keep coming, but no new process can access it. Here we just lost all
>>> benefits of making it 'fd' based. Theoretically we can extend tracing to
>>> be fd-based too and tracepoints will auto-detach upon process exit,
>>> but that's not going to work for all other global events. Like networking
>>> components (bridge, ovs, â) are global and they won't be adding
>>> fd-based interfaces.
>>> I'm still thinking about it, but it looks like that any process-local
>>> ebpf_prog_id scheme is not going to work for global events. Thoughts?
>>
>> Hmm. Maybe these things do need global ids for tracing, or at least
>> there need to be some way to stash them somewhere and find them again.
>> I suppose that debugfs could have symlinks to them, but I don't know
>> how hard that would be to implement or how awkward it would be to use.
>>
>> I imagine there's some awkwardness regardless. For tracing, if I
>> create map 75 and eBPF program 492 that uses map 75, then I still need
>> to remember that map 75 is the map I want (or I need to parse the eBPF
>> program later on).
>>
>> How do you imagine the userspace code working? Maybe it would make
>> sense to add some nlattrs for eBPF programs to map between referenced
>> objects and nicknames for them. Then user code could look at
>> /sys/kernel/debug/whatever/nickname_of_map to resolve the map id or
>> even just open it directly.
>
> I want to avoid string names, since they will force new 'strtab', 'symtab'
> sections in the programs/maps and will uglify the user interface quite a bit.

To be fair, you really need to imitate ELF here. A very simple
relocation-like table should do the trick.

>
> Back in september one loadable unit was: one eBPF program + set of maps,
> but tracing requirements forced a change, since multiple programs need
> to access the same map and maps may need to be pre-populated before
> the programs start executing, so I've split maps and programs into mostly
> independent entities, but programs still need to think of maps as local:
> For example I want to do a skb leak check 'tracing filter':
> - attach this program to kretprobe of __alloc_skb():
> u64 key = (u64) skb;
> u64 value = bpf_get_time();
> bpf_update_map_elem(1/*const_map_id*/, &key, &value);
> - attach this program to consume_skb and kfree_skb tracepoints:
> u64 key = (u64) skb;
> bpf_delete_map_elem(1/*const_map_id*/, &key);
> - and have user space do:
> prior to loading:
> bpf_create_map(1/*map_id*/, 8/*key_size*/, 8/*value*/, 1M /*max_entries*/)
> and then periodically iterate the map to see whether any skb stayed
> in the map for too long.
>
> Programs need to be written with hard coded map_ids otherwise usability
> suffers, so I did global 32-bit id in this RFC
>, but this indeed doesn't work

Really? That will mean that you have to edit the source of your
filter program if the small integer map number you chose conflicts
with another program. That sounds unpleasant.

> for unprivileged chrome browser unless programs are previously loaded
> by root and chrome only does attach to seccomp.
>
> So here is the non-root bpf syscall interface I'm thinking about:
>
> ufd = bpf_create_map(map_id, key_size, value_size, max_entries);
>
> it will create a global map in the system which will be accessible
> in this process via 'ufd'. Internally this 'ufd' will be assigned global map_id
> and process-local map_id that was passed as a 1st argument.
> To do update/lookup the process will use bpf_map_xxx_elem(ufd,â)
>

Erk. Unprivileged programs shouldn't be able to allocate global ids
of their choosing, especially if privileged programs can also do it.
Otherwise unprivileged programs can force a collision and possibly
steal information.

> Then to load eBPF program the process will do:
> ufd = bpf_prog_load(prog_type, ebpf_insn_array, license)
> and instructions will be referring to maps via local map_id that
> was hard coded as part of the program.

I think relocations would be must prettier than per-process map id tables.

>
> Beyond the normal create_map, update/lookup/delete, load_prog
> operations (that are accessible to both root and non-root), the root user
> gains one more operations: bpf_get_global_id(ufd) that returns
> global map_id or prog_id. This id can be attached to global events
> like tracing. Non-root users lose ability to do delete_map and
> unload_prog (they do close(ufd) instead), so this ops are for root
> only and operate on global ids.
> This is the cleanest way I could think of to combine non-root
> security, per-process id and global id all in one API. Thoughts?

I think I'm okay with this part, although an interface to get a map fd
given some reference to the program (in sysfs) that uses it would also
work and maybe be more straightforward.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/