Re: [PATCH RFC net-next 03/14] bpf: introduce syscall(BPF, ...) and BPF maps

From: Andy Lutomirski
Date: Fri Jul 04 2014 - 11:17:53 EST


On Wed, Jul 2, 2014 at 7:29 PM, Alexei Starovoitov <ast@xxxxxxxxxxxx> wrote:
> On Wed, Jul 2, 2014 at 6:43 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>> On Tue, Jul 1, 2014 at 10:33 PM, Alexei Starovoitov <ast@xxxxxxxxxxxx> wrote:
>>> I want to avoid string names, since they will force new 'strtab', 'symtab'
>>> sections in the programs/maps and will uglify the user interface quite a bit.
>>
>> To be fair, you really need to imitate ELF here. A very simple
>> relocation-like table should do the trick.
>
> simple.. right :) I do see the amount of struggle you have with
> binutils and vdso.
> I really don't want to add relocation unless this is last resort.
> Especially since it can be solved without it.
> I don't think I explained it enough in my last emailâ trying again:
>
>>> Back in september one loadable unit was: one eBPF program + set of maps,
>>> but tracing requirements forced a change, since multiple programs need
>>> to access the same map and maps may need to be pre-populated before
>>> the programs start executing, so I've split maps and programs into mostly
>>> independent entities, but programs still need to think of maps as local:
>>> For example I want to do a skb leak check 'tracing filter':
>>> - attach this program to kretprobe of __alloc_skb():
>>> u64 key = (u64) skb;
>>> u64 value = bpf_get_time();
>>> bpf_update_map_elem(1/*const_map_id*/, &key, &value);
>>> - attach this program to consume_skb and kfree_skb tracepoints:
>>> u64 key = (u64) skb;
>>> bpf_delete_map_elem(1/*const_map_id*/, &key);
>>> - and have user space do:
>>> prior to loading:
>>> bpf_create_map(1/*map_id*/, 8/*key_size*/, 8/*value*/, 1M /*max_entries*/)
>>> and then periodically iterate the map to see whether any skb stayed
>>> in the map for too long.
>>>
>>> Programs need to be written with hard coded map_ids otherwise usability
>>> suffers, so I did global 32-bit id in this RFC
>>>, but this indeed doesn't work
>>
>> Really? That will mean that you have to edit the source of your
>> filter program if the small integer map number you chose conflicts
>> with another program. That sounds unpleasant.
>
> unpleasant. exactly. that's why I'm proposing per-process local map-id,
> so that programs don't need to be edited.
>
>>> for unprivileged chrome browser unless programs are previously loaded
>>> by root and chrome only does attach to seccomp.
>>>
>>> So here is the non-root bpf syscall interface I'm thinking about:
>>>
>>> ufd = bpf_create_map(map_id, key_size, value_size, max_entries);
>>>
>>> it will create a global map in the system which will be accessible
>>> in this process via 'ufd'. Internally this 'ufd' will be assigned global map_id
>>> and process-local map_id that was passed as a 1st argument.
>>> To do update/lookup the process will use bpf_map_xxx_elem(ufd,â)
>>>
>>
>> Erk. Unprivileged programs shouldn't be able to allocate global ids
>> of their choosing, especially if privileged programs can also do it.
>> Otherwise unprivileged programs can force a collision and possibly
>> steal information.
>
> of course. that's not what said.
>
>>> Then to load eBPF program the process will do:
>>> ufd = bpf_prog_load(prog_type, ebpf_insn_array, license)
>>> and instructions will be referring to maps via local map_id that
>>> was hard coded as part of the program.
>>
>> I think relocations would be must prettier than per-process map id tables.
>
> I think per process map_id are much cleaner.
>
> non-root API:
>
> ufd = bpf_create_map(local_map_id,â )
> bpf_map_update/delete/lookup_elem(ufd,â)
> ufd = bpf_prog_load(insns)
> close(ufd)
>
> root only API:
>
> global_id = bpf_get_id(ufd) // returns either map or prog global id
> bpf_map_delete(global_map_id)
> bpf_prog_unload(global_prog_id)
>
> Details:
>
> ufd = bpf_create_map(local_map_id, ...);
>
> local_map_id - process local map_id
> (this id is used to access maps from eBPF program loaded by this process)
>
> ufd - process local file descriptor
> (used to update/lookup maps from this process)
>
> global_map_id = bpf_get_id(ufd)
> this is root only call to get global_ids and pass them to global events
> like tracing.
>
> global ids will only be seen by root. There is no way for root or non-root
> to influence id ranges.
>
>>> Beyond the normal create_map, update/lookup/delete, load_prog
>>> operations (that are accessible to both root and non-root), the root user
>>> gains one more operations: bpf_get_global_id(ufd) that returns
>>> global map_id or prog_id. This id can be attached to global events
>>> like tracing. Non-root users lose ability to do delete_map and
>>> unload_prog (they do close(ufd) instead), so this ops are for root
>>> only and operate on global ids.
>>> This is the cleanest way I could think of to combine non-root
>>> security, per-process id and global id all in one API. Thoughts?
>>
>> I think I'm okay with this part, although an interface to get a map fd
>> given some reference to the program (in sysfs) that uses it would also
>> work and maybe be more straightforward.
>
> If you meant debugfs, then yes. I'm planning to add a way for root
> to see all loaded programs and maps (similar to /proc as lsmod does),
> and then do bpf_map_delete/bpf_prog_unload (similar to rmmod)
>
> setsockopt and seccomp will be non-root and programs will go
> through additional dont_leak_pointers check in verifier.
> tracing/dtrace will be for root only, since they would need to attach
> to global events.
>
> I think it will be cleaner once I finish fd conversion as a patch.

OK

FWIW, per-process local id maps sound almost equivalent to relocations
-- the latter could be as simple as an extra nlattr giving a list of
pairs of (per-eBPF-program id, fd).

My current binutils mess is mainly just because I'm trying to do
something weird with an old, old file format that needs to support
lots of legacy tools. You won't have that problem.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/