Re: [PATCH bpf-next 3/5] libbpf: add low level TC-BPF API

From: Andrii Nakryiko
Date: Mon Apr 05 2021 - 13:21:24 EST


On Fri, Apr 2, 2021 at 8:27 AM Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> wrote:
>
> On Fri, Apr 02, 2021 at 05:49:29AM IST, Daniel Borkmann wrote:
> > On 3/31/21 11:44 AM, Kumar Kartikeya Dwivedi wrote:
> > > On Wed, Mar 31, 2021 at 02:55:47AM IST, Daniel Borkmann wrote:
> > > > Do we even need the _block variant? I would rather prefer to take the chance
> > > > and make it as simple as possible, and only iff really needed extend with
> > > > other APIs, for example:
> > >
> > > The block variant can be dropped, I'll use the TC_BLOCK/TC_DEV alternative which
> > > sets parent_id/ifindex properly.
> > >
> > > > bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS});
> > > >
> > > > Internally, this will create the sch_clsact qdisc & cls_bpf filter instance
> > > > iff not present yet, and attach to a default prio 1 handle 1, and _always_ in
> > > > direct-action mode. This is /as simple as it gets/ and we don't need to bother
> > > > users with more complex tc/cls_bpf internals unless desired. For example,
> > > > extended APIs could add prio/parent so that multi-prog can be attached to a
> > > > single cls_bpf instance, but even that could be a second step, imho.
> > >
> > > I am not opposed to clsact qdisc setup if INGRESS/EGRESS is supplied (not sure
> > > how others feel about it).
> >
> > What speaks against it? It would be 100% clear from API side where the prog is
> > being attached. Same as with tc cmdline where you specify 'ingress'/'egress'.
> >
>
> Ok, I will add the qdisc setup in the next revision.
>
> > > We could make direct_action mode default, and similarly choose prio
> >
> > To be honest, I wouldn't even support a mode from the lib/API side where direct_action
> > is not set. It should always be forced to true. Everything else is rather broken
> > setup-wise, imho, since it won't scale. We added direct_action a bit later to the
> > kernel than original cls_bpf, but if I would do it again today, I'd make it the
> > only available option. I don't see a reasonable use-case where you have it to false.
> >
>
> I'm all for doing that, but in some sense that also speaks against SCHED_ACT
> support. Currently, you can load SCHED_ACT programs using this series, but not
> really bind them to classifier. I left that option open to a future patch, it
> would just reuse the existing tc_act_add_action helper (also why I kept it in
> its own function). Maybe we need to reconsider that, if direct action is the
> only recommended way going forward (to discourage people from using SCHED_ACT),
> or just add opts to do all the setup in low level API, instead of leaving it
> incomplete.
>
> > > as 1 by default instead of letting the kernel do it. Then you can just pass in
> > > NULL for bpf_tc_cls_opts and be close to what you're proposing. For protocol we
> > > can choose ETH_P_ALL by default too if the user doesn't set it.
> >
> > Same here with ETH_P_ALL, I'm not sure anyone uses anything other than ETH_P_ALL,
> > so yes, that should be default.
>
> Ack.
>
> >
> > > With these modifications, the equivalent would look like
> > > bpf_tc_cls_attach(prog_fd, TC_DEV(ifindex, INGRESS), NULL, &id);
> >
> > Few things compared to bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS}):
> >
> > 1) nit, but why even 'cls' in the name. I think we shouldn't expose such old-days
> > tc semantics to a user. Just bpf_tc_attach() is cleaner/simpler to understand.
>
> Since it would make it clear this is for SCHED_CLS progs, likewise bpf_tc_act_*
> is for SCHED_ACT progs. Not opposed to changing the name.
>
> > 2) What's the 'TC_DEV(ifindex, INGRESS)' macro doing exactly? Looks unnecessary,
> > why not regular args to the API?
>
> It is very easy to support BLOCK (I know it's not really popular here, but I
> think if supporting it just requires adding a macro, then we can go ahead). So
> the user can use TC_BLOCK(block_idx) instead of remembering ifindex is to be set
> to TCM_IFINDEX_MAGIC_BLOCK and parent_id to actual block index. It will just
> expand to:
>
> #define TC_BLOCK(block_idx) TCM_IFINDEX_MAGIC_BLOCK, (block_idx)
>
> TC_DEV macro can be dropped, since user can directly pass ifindex and parent_id.

if _block variant is just a special ifindex value, then it should be
fine for users to know such a detail (we can leave a comment
mentioning this specifically), especially given it's not a very
popular thing. Almost doubling amount of APIs just for this doesn't
make much sense, IMO.

>
> > 3) Exposed bpf_tc_attach() API could internally call a bpf_tc_attach_opts() API
> > with preset defaults, and the latter could have all the custom bits if the user
> > needs to go beyond the simple API, so from your bpf_tc_cls_attach() I'd also
> > drop the NULL.
>
> Ok, this is probably better (but maybe we can do this for the high-level
> bpf_program__attach that returns a bpf_link * instead of introducing yet
> another function).


If we know that we need variant with options, I'd vote for having just
one bpf_tc_attach() API which always takes options. Passing NULL for
opts is simple, no need for two APIs, I think.

>
> > 4) For the simple API I'd likely also drop the id (you could have a query API if
> > needed).
> >
>
> This would be fine, because it's not a fast path or anything, but right now we
> return the id using the netlink response, otherwise for query we have to open
> the socket, prepare the msg, send and recv again. So it's a minor optimization.
>
> However, there's one other problem. In an earlier version of this series, I
> didn't keep the id/index out parameters (to act as handle to the newly attached
> filter/action). This lead to problems on query. Suppose a user doesn't properly
> fill the opts during query (e.g. in case of filters). This means the netlink
> dump includes all filters matching filled in attributes. If the prog_id for all
> of these is same (e.g. all have same bpf classifier prog attached to them), it
> becomes impossible to determine which one is the filter user asked for. It is
> not possible to enforce filling in all kinds of attributes since some can be
> left out and assigned by default in the kernel (priority, chain_index etc.). So
> returning the newly created filter's id turned out to be the best option. This
> is also used to stash filter related information in bpf_link to properly release
> it later.
>
> The same problem happens with actions, where we look up using the prog_id, we
> multiple actions with different index can match on same prog_id. It is not
> possible to determine which index corresponds to last loaded action.
>
> So unless there's a better idea on how to deal with this, a query API won't work
> for the case where same bpf prog is attached more than once. Returning the
> id/index during attach seemed better than all other options we considered.

Which parts of that id struct is the data that caller might not know
or can't know? Is it handle and chain_index? Or just one of them?
Or?... If there is something that has to be returned back, I'd keep
only that, instead of returning 6+ fields, most of which user should
already know.

>
> --
> Kartikeya