Re: [RFC PATCH bpf-next 00/13] bpf: Introduce BPF namespace

From: Toke Høiland-Jørgensen
Date: Mon Mar 27 2023 - 16:51:22 EST


Yafang Shao <laoar.shao@xxxxxxxxx> writes:

> On Sun, Mar 26, 2023 at 6:49 PM Toke Høiland-Jørgensen <toke@xxxxxxxxxx> wrote:
>>
>> Yafang Shao <laoar.shao@xxxxxxxxx> writes:
>>
>> > Currently only CAP_SYS_ADMIN can iterate BPF object IDs and convert IDs
>> > to FDs, that's intended for BPF's security model[1]. Not only does it
>> > prevent non-privilidged users from getting other users' bpf program, but
>> > also it prevents the user from iterating his own bpf objects.
>> >
>> > In container environment, some users want to run bpf programs in their
>> > containers. These users can run their bpf programs under CAP_BPF and
>> > some other specific CAPs, but they can't inspect their bpf programs in a
>> > generic way. For example, the bpftool can't be used as it requires
>> > CAP_SYS_ADMIN. That is very inconvenient.
>> >
>> > Without CAP_SYS_ADMIN, the only way to get the information of a bpf object
>> > which is not created by the process itself is with SCM_RIGHTS, that
>> > requires each processes which created bpf object has to implement a unix
>> > domain socket to share the fd of a bpf object between different
>> > processes, that is really trivial and troublesome.
>> >
>> > Hence we need a better mechanism to get bpf object info without
>> > CAP_SYS_ADMIN.
>> >
>> > BPF namespace is introduced in this patchset with an attempt to remove
>> > the CAP_SYS_ADMIN requirement. The user can create bpf map, prog and
>> > link in a specific bpf namespace, then these bpf objects will not be
>> > visible to the users in a different bpf namespace. But these bpf
>> > objects are visible to its parent bpf namespace, so the sys admin can
>> > still iterate and inspect them.
>> >
>> > BPF namespace is similar to PID namespace, and the bpf objects are
>> > similar to tasks, so BPF namespace is very easy to understand. These
>> > patchset only implements BPF namespace for bpf map, prog and link. In the
>> > future we may extend it to other bpf objects like btf, bpffs and etc.
>>
>> May? I think we should cover all of the existing BPF objects from the
>> beginning here, or we may miss important interactions that will
>> invalidate the whole idea.
>
> This patchset is intended to address iterating bpf IDs and converting
> IDs to FDs. To be more specific, it covers
> BPF_{PROG,MAP,LINK}_GET_NEXT_ID and BPF_{PROG,MAP,LINK}_GET_FD_BY_ID.
> It should also include BPF_BTF_GET_NEXT_ID and BPF_BTF_GET_FD_BY_ID,
> but I don't implement it because I find we can do more wrt BTF, for
> example, if we can expose a small amount of BTFs in the vmlinux to
> non-root bpf namespace.
> But, yes, I should implement BTF ID in this patchset.

Right, as you can see by my comment on that patch, not including the btf
id is a tad confusing, so yeah, better include that.

>> In particular, I'm a little worried about the
>> interaction between namespaces and bpffs; what happens if you're in a
>> bpf namespace and you try to read a BPF object from a bpffs that belongs
>> to a different namespace? Does the operation fail? Is the object hidden
>> entirely? Something else?
>>
>
> bpffs is a different topic and it can be implemented in later patchsets.
> bpffs has its own specific problem even without the bpf namespace.
> 1. The user can always get the information of a bpf object through its
> corresponding pinned file.
> In our practice, different container users have different bpffs, and
> we allow the container user to bind-mount its bpffs only, so others'
> bpffs are invisible.
> To make it better with the bpf namespace, I think we can fail the
> operation if the pinned file doesn't belong to its bpf namespace. That
> said, we will add pinned bpf files into the bpf namespace in the next
> step.
>
> 2. The user can always iterate bpf objects through progs.debug and maps.debug
> progs.debug and maps.debug are debugging purposes only. So I think we
> can handle it later.

Well, I disagree. Working out these issues with bpffs is an important
aspect to get a consistent API, and handwaving it away risks merging
something that will turn out to not be workable further down the line at
which point we can't change it.

-Toke