Re: [PATCH] bpf: devmap: allow for repeated redirect

From: Toke Høiland-Jørgensen
Date: Wed Sep 11 2024 - 10:14:47 EST


Florian Kauer <florian.kauer@xxxxxxxxxxxxx> writes:

> On 9/11/24 11:25, Toke Høiland-Jørgensen wrote:
>> Florian Kauer <florian.kauer@xxxxxxxxxxxxx> writes:
>>
>>> Hi Toke,
>>>
>>> On 9/10/24 10:34, Toke Høiland-Jørgensen wrote:
>>>> Florian Kauer <florian.kauer@xxxxxxxxxxxxx> writes:
>>>>
>>>>> Currently, returning XDP_REDIRECT from a xdp/devmap program
>>>>> is considered as invalid action and an exception is traced.
>>>>>
>>>>> While it might seem counterintuitive to redirect in a xdp/devmap
>>>>> program (why not just redirect to the correct interface in the
>>>>> first program?), we faced several use cases where supporting
>>>>> this would be very useful.
>>>>>
>>>>> Most importantly, they occur when the first redirect is used
>>>>> with the BPF_F_BROADCAST flag. Using this together with xdp/devmap
>>>>> programs, enables to perform different actions on clones of
>>>>> the same incoming frame. In that case, it is often useful
>>>>> to redirect either to a different CPU or device AFTER the cloning.
>>>>>
>>>>> For example:
>>>>> - Replicate the frame (for redundancy according to IEEE 802.1CB FRER)
>>>>> and then use the second redirect with a cpumap to select
>>>>> the path-specific egress queue.
>>>>>
>>>>> - Also, one of the paths might need an encapsulation that
>>>>> exceeds the MTU. So a second redirect can be used back
>>>>> to the ingress interface to send an ICMP FRAG_NEEDED packet.
>>>>>
>>>>> - For OAM purposes, you might want to send one frame with
>>>>> OAM information back, while the original frame in passed forward.
>>>>>
>>>>> To enable these use cases, add the XDP_REDIRECT case to
>>>>> dev_map_bpf_prog_run. Also, when this is called from inside
>>>>> xdp_do_flush, the redirect might add further entries to the
>>>>> flush lists that are currently processed. Therefore, loop inside
>>>>> xdp_do_flush until no more additional redirects were added.
>>>>>
>>>>> Signed-off-by: Florian Kauer <florian.kauer@xxxxxxxxxxxxx>
>>>>
>>>> This is an interesting use case! However, your implementation makes it
>>>> way to easy to end up in a situation that loops forever, so I think we
>>>> should add some protection against that. Some details below:
>>>>
>>>>> ---
>>>>> include/linux/bpf.h | 4 ++--
>>>>> include/trace/events/xdp.h | 10 ++++++----
>>>>> kernel/bpf/devmap.c | 37 +++++++++++++++++++++++++++--------
>>>>> net/core/filter.c | 48 +++++++++++++++++++++++++++-------------------
>>>>> 4 files changed, 65 insertions(+), 34 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>>>>> index 3b94ec161e8c..1b57cbabf199 100644
>>>>> --- a/include/linux/bpf.h
>>>>> +++ b/include/linux/bpf.h
>>>>> @@ -2498,7 +2498,7 @@ struct sk_buff;
>>>>> struct bpf_dtab_netdev;
>>>>> struct bpf_cpu_map_entry;
>>>>>
>>>>> -void __dev_flush(struct list_head *flush_list);
>>>>> +void __dev_flush(struct list_head *flush_list, int *redirects);
>>>>> int dev_xdp_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
>>>>> struct net_device *dev_rx);
>>>>> int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_frame *xdpf,
>>>>> @@ -2740,7 +2740,7 @@ static inline struct bpf_token *bpf_token_get_from_fd(u32 ufd)
>>>>> return ERR_PTR(-EOPNOTSUPP);
>>>>> }
>>>>>
>>>>> -static inline void __dev_flush(struct list_head *flush_list)
>>>>> +static inline void __dev_flush(struct list_head *flush_list, int *redirects)
>>>>> {
>>>>> }
>>>>>
>>>>> diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
>>>>> index a7e5452b5d21..fba2c457e727 100644
>>>>> --- a/include/trace/events/xdp.h
>>>>> +++ b/include/trace/events/xdp.h
>>>>> @@ -269,9 +269,9 @@ TRACE_EVENT(xdp_devmap_xmit,
>>>>>
>>>>> TP_PROTO(const struct net_device *from_dev,
>>>>> const struct net_device *to_dev,
>>>>> - int sent, int drops, int err),
>>>>> + int sent, int drops, int redirects, int err),
>>>>>
>>>>> - TP_ARGS(from_dev, to_dev, sent, drops, err),
>>>>> + TP_ARGS(from_dev, to_dev, sent, drops, redirects, err),
>>>>>
>>>>> TP_STRUCT__entry(
>>>>> __field(int, from_ifindex)
>>>>> @@ -279,6 +279,7 @@ TRACE_EVENT(xdp_devmap_xmit,
>>>>> __field(int, to_ifindex)
>>>>> __field(int, drops)
>>>>> __field(int, sent)
>>>>> + __field(int, redirects)
>>>>> __field(int, err)
>>>>> ),
>>>>>
>>>>> @@ -288,16 +289,17 @@ TRACE_EVENT(xdp_devmap_xmit,
>>>>> __entry->to_ifindex = to_dev->ifindex;
>>>>> __entry->drops = drops;
>>>>> __entry->sent = sent;
>>>>> + __entry->redirects = redirects;
>>>>> __entry->err = err;
>>>>> ),
>>>>>
>>>>> TP_printk("ndo_xdp_xmit"
>>>>> " from_ifindex=%d to_ifindex=%d action=%s"
>>>>> - " sent=%d drops=%d"
>>>>> + " sent=%d drops=%d redirects=%d"
>>>>> " err=%d",
>>>>> __entry->from_ifindex, __entry->to_ifindex,
>>>>> __print_symbolic(__entry->act, __XDP_ACT_SYM_TAB),
>>>>> - __entry->sent, __entry->drops,
>>>>> + __entry->sent, __entry->drops, __entry->redirects,
>>>>> __entry->err)
>>>>> );
>>>>>
>>>>> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
>>>>> index 7878be18e9d2..89bdec49ea40 100644
>>>>> --- a/kernel/bpf/devmap.c
>>>>> +++ b/kernel/bpf/devmap.c
>>>>> @@ -334,7 +334,8 @@ static int dev_map_hash_get_next_key(struct bpf_map *map, void *key,
>>>>> static int dev_map_bpf_prog_run(struct bpf_prog *xdp_prog,
>>>>> struct xdp_frame **frames, int n,
>>>>> struct net_device *tx_dev,
>>>>> - struct net_device *rx_dev)
>>>>> + struct net_device *rx_dev,
>>>>> + int *redirects)
>>>>> {
>>>>> struct xdp_txq_info txq = { .dev = tx_dev };
>>>>> struct xdp_rxq_info rxq = { .dev = rx_dev };
>>>>> @@ -359,6 +360,13 @@ static int dev_map_bpf_prog_run(struct bpf_prog *xdp_prog,
>>>>> else
>>>>> frames[nframes++] = xdpf;
>>>>> break;
>>>>> + case XDP_REDIRECT:
>>>>> + err = xdp_do_redirect(rx_dev, &xdp, xdp_prog);
>>>>> + if (unlikely(err))
>>>>> + xdp_return_frame_rx_napi(xdpf);
>>>>> + else
>>>>> + *redirects += 1;
>>>>> + break;
>>>>
>>>> It's a bit subtle, but dev_map_bpf_prog_run() also filters the list of
>>>> frames in the queue in-place (the frames[nframes++] = xdpf; line above),
>>>> which only works under the assumption that the array in bq->q is not
>>>> modified while this loop is being run. But now you're adding a call in
>>>> the middle that may result in the packet being put back on the same
>>>> queue in the middle, which means that this assumption no longer holds.
>>>>
>>>> So you need to clear the bq->q queue first for this to work.
>>>> Specifically, at the start of bq_xmit_all(), you'll need to first copy
>>>> all the packet pointer onto an on-stack array, then run the rest of the
>>>> function on that array. There's already an initial loop that goes
>>>> through all the frames, so you can just do it there.
>>>>
>>>> So the loop at the start of bq_xmit_all() goes from the current:
>>>>
>>>> for (i = 0; i < cnt; i++) {
>>>> struct xdp_frame *xdpf = bq->q[i];
>>>>
>>>> prefetch(xdpf);
>>>> }
>>>>
>>>>
>>>> to something like:
>>>>
>>>> struct xdp_frame *frames[DEV_MAP_BULK_SIZE];
>>>>
>>>> for (i = 0; i < cnt; i++) {
>>>> struct xdp_frame *xdpf = bq->q[i];
>>>>
>>>> prefetch(xdpf);
>>>> frames[i] = xdpf;
>>>> }
>>>>
>>>> bq->count = 0; /* bq is now empty, use the 'frames' and 'cnt'
>>>> stack variables for the rest of the function */
>>>>
>>>>
>>>>
>>>>> default:
>>>>> bpf_warn_invalid_xdp_action(NULL, xdp_prog, act);
>>>>> fallthrough;
>>>>> @@ -373,7 +381,7 @@ static int dev_map_bpf_prog_run(struct bpf_prog *xdp_prog,
>>>>> return nframes; /* sent frames count */
>>>>> }
>>>>>
>>>>> -static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
>>>>> +static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags, int *redirects)
>>>>> {
>>>>> struct net_device *dev = bq->dev;
>>>>> unsigned int cnt = bq->count;
>>>>> @@ -390,8 +398,10 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
>>>>> prefetch(xdpf);
>>>>> }
>>>>>
>>>>> + int new_redirects = 0;
>>>>> if (bq->xdp_prog) {
>>>>> - to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev, bq->dev_rx);
>>>>> + to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev, bq->dev_rx,
>>>>> + &new_redirects);
>>>>> if (!to_send)
>>>>> goto out;
>>>>> }
>>>>> @@ -413,19 +423,21 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
>>>>>
>>>>> out:
>>>>> bq->count = 0;
>>>>> - trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, cnt - sent, err);
>>>>> + *redirects += new_redirects;
>>>>> + trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, cnt - sent - new_redirects,
>>>>> + new_redirects, err);
>>>>> }
>>>>>
>>>>> /* __dev_flush is called from xdp_do_flush() which _must_ be signalled from the
>>>>> * driver before returning from its napi->poll() routine. See the comment above
>>>>> * xdp_do_flush() in filter.c.
>>>>> */
>>>>> -void __dev_flush(struct list_head *flush_list)
>>>>> +void __dev_flush(struct list_head *flush_list, int *redirects)
>>>>> {
>>>>> struct xdp_dev_bulk_queue *bq, *tmp;
>>>>>
>>>>> list_for_each_entry_safe(bq, tmp, flush_list, flush_node) {
>>>>> - bq_xmit_all(bq, XDP_XMIT_FLUSH);
>>>>> + bq_xmit_all(bq, XDP_XMIT_FLUSH, redirects);
>>>>> bq->dev_rx = NULL;
>>>>> bq->xdp_prog = NULL;
>>>>> __list_del_clearprev(&bq->flush_node);
>>>>> @@ -458,8 +470,17 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
>>>>> {
>>>>> struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);
>>>>>
>>>>> - if (unlikely(bq->count == DEV_MAP_BULK_SIZE))
>>>>> - bq_xmit_all(bq, 0);
>>>>> + if (unlikely(bq->count == DEV_MAP_BULK_SIZE)) {
>>>>> + int redirects = 0;
>>>>> +
>>>>> + bq_xmit_all(bq, 0, &redirects);
>>>>> +
>>>>> + /* according to comment above xdp_do_flush() in
>>>>> + * filter.c, xdp_do_flush is always called at
>>>>> + * the end of the NAPI anyway, so no need to act
>>>>> + * on the redirects here
>>>>> + */
>>>>
>>>> While it's true that it will be called again in NAPI, the purpose of
>>>> calling bq_xmit_all() here is to make room space for the packet on the
>>>> bulk queue that we're about to enqueue, and if bq_xmit_all() can just
>>>> put the packet back on the queue, there is no guarantee that this will
>>>> succeed. So you will have to handle that case here.
>>>>
>>>> Since there's also a potential infinite recursion issue in the
>>>> do_flush() functions below, I think it may be better to handle this
>>>> looping issue inside bq_xmit_all().
>>>>
>>>> I.e., structure the code so that bq_xmit_all() guarantees that when it
>>>> returns it has actually done its job; that is, that bq->q is empty.
>>>>
>>>> Given the above "move all frames out of bq->q at the start" change, this
>>>> is not all that hard. Simply add a check after the out: label (in
>>>> bq_xmit_all()) to check if bq->count is actually 0, and if it isn't,
>>>> start over from the beginning of that function. This also makes it
>>>> straight forward to add a recursion limit; after looping a set number of
>>>> times (say, XMIT_RECURSION_LIMIT), simply turn XDP_REDIRECT into drops.
>>>>
>>>> There will need to be some additional protection against looping forever
>>>> in __dev_flush(), to handle the case where a packet is looped between
>>>> two interfaces. This one is a bit trickier, but a similar recursion
>>>> counter could be used, I think.
>>>
>>>
>>> Thanks a lot for the extensive support!
>>> Regarding __dev_flush(), could the following already be sufficient?
>>>
>>> void __dev_flush(struct list_head *flush_list)
>>> {
>>> struct xdp_dev_bulk_queue *bq, *tmp;
>>> int i,j;
>>>
>>> for (i = 0; !list_empty(flush_list) && i < XMIT_RECURSION_LIMIT; i++) {
>>> /* go through list in reverse so that new items
>>> * added to the flush_list will only be handled
>>> * in the next iteration of the outer loop
>>> */
>>> list_for_each_entry_safe_reverse(bq, tmp, flush_list, flush_node) {
>>> bq_xmit_all(bq, XDP_XMIT_FLUSH);
>>> bq->dev_rx = NULL;
>>> bq->xdp_prog = NULL;
>>> __list_del_clearprev(&bq->flush_node);
>>> }
>>> }
>>>
>>> if (i == XMIT_RECURSION_LIMIT) {
>>> pr_warn("XDP recursion limit hit, expect packet loss!\n");
>>>
>>> list_for_each_entry_safe(bq, tmp, flush_list, flush_node) {
>>> for (j = 0; j < bq->count; j++)
>>> xdp_return_frame_rx_napi(bq->q[i]);
>>>
>>> bq->count = 0;
>>> bq->dev_rx = NULL;
>>> bq->xdp_prog = NULL;
>>> __list_del_clearprev(&bq->flush_node);
>>> }
>>> }
>>> }
>>
>> Yeah, this would work, I think (neat trick with iterating the list in
>> reverse!), but instead of the extra loop in the end, I would suggest
>> passing in an extra 'allow_redirect' parameter to bq_xmit_all(). Since
>> you'll already have to handle the recursion limit inside that function,
>> you can just reuse the same functionality by passing in the parameter in
>> the last iteration of the first loop.
>>
>> Also, definitely don't put an unconditional warn_on() in the fast path -
>> that brings down the system really quickly if it's misconfigured :)
>>
>> A warn_on_once() could work, but really I would suggest just reusing the
>> _trace_xdp_redirect_map_err() tracepoint with a new error code (just has
>> to be one we're not currently using and that vaguely resembles what this
>> is about; ELOOP, EOVERFLOW or EDEADLOCK, maybe?).
>
>
> The 'allow_redirect' parameter is a very good idea! If I also forward that
> to dev_map_bpf_prog_run and do something like the following, I can just
> reuse the existing error handling. Or is that too implict/too ugly?
>
> switch (act) {
> case XDP_PASS:
> err = xdp_update_frame_from_buff(&xdp, xdpf);
> if (unlikely(err < 0))
> xdp_return_frame_rx_napi(xdpf);
> else
> frames[nframes++] = xdpf;
> break;
> case XDP_REDIRECT:
> if (allow_redirect) {
> err = xdp_do_redirect(rx_dev, &xdp, xdp_prog);
> if (unlikely(err))
> xdp_return_frame_rx_napi(xdpf);
> else
> *redirects += 1;
> break;
> } else
> fallthrough;
> default:
> bpf_warn_invalid_xdp_action(NULL, xdp_prog, act);
> fallthrough;

Yes, I was imagining something like this. Not sure if we want to turn it
into an "invalid action" this way (it should probably be a separate
tracepoint as I mentioned in the previous email). But otherwise, yeah!

> case XDP_ABORTED:
> trace_xdp_exception(tx_dev, xdp_prog, act);
> fallthrough;
> case XDP_DROP:
> xdp_return_frame_rx_napi(xdpf);
> break;
> }
>
>
>>
>>>> In any case, this needs extensive selftests, including ones with devmap
>>>> programs that loop packets (both redirect from a->a, and from
>>>> a->b->a->b) to make sure the limits work correctly.
>>>
>>>
>>> Good point! I am going to prepare some.
>>>
>>>
>>>>
>>>>> + }
>>>>>
>>>>> /* Ingress dev_rx will be the same for all xdp_frame's in
>>>>> * bulk_queue, because bq stored per-CPU and must be flushed
>>>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>>>> index 8569cd2482ee..b33fc0b1444a 100644
>>>>> --- a/net/core/filter.c
>>>>> +++ b/net/core/filter.c
>>>>> @@ -4287,14 +4287,18 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
>>>>> void xdp_do_flush(void)
>>>>> {
>>>>> struct list_head *lh_map, *lh_dev, *lh_xsk;
>>>>> + int redirect;
>>>>>
>>>>> - bpf_net_ctx_get_all_used_flush_lists(&lh_map, &lh_dev, &lh_xsk);
>>>>> - if (lh_dev)
>>>>> - __dev_flush(lh_dev);
>>>>> - if (lh_map)
>>>>> - __cpu_map_flush(lh_map);
>>>>> - if (lh_xsk)
>>>>> - __xsk_map_flush(lh_xsk);
>>>>> + do {
>>>>> + redirect = 0;
>>>>> + bpf_net_ctx_get_all_used_flush_lists(&lh_map, &lh_dev, &lh_xsk);
>>>>> + if (lh_dev)
>>>>> + __dev_flush(lh_dev, &redirect);
>>>>> + if (lh_map)
>>>>> + __cpu_map_flush(lh_map);
>>>>> + if (lh_xsk)
>>>>> + __xsk_map_flush(lh_xsk);
>>>>> + } while (redirect > 0);
>>>>> }
>>>>> EXPORT_SYMBOL_GPL(xdp_do_flush);
>>>>>
>>>>> @@ -4303,20 +4307,24 @@ void xdp_do_check_flushed(struct napi_struct *napi)
>>>>> {
>>>>> struct list_head *lh_map, *lh_dev, *lh_xsk;
>>>>> bool missed = false;
>>>>> + int redirect;
>>>>>
>>>>> - bpf_net_ctx_get_all_used_flush_lists(&lh_map, &lh_dev, &lh_xsk);
>>>>> - if (lh_dev) {
>>>>> - __dev_flush(lh_dev);
>>>>> - missed = true;
>>>>> - }
>>>>> - if (lh_map) {
>>>>> - __cpu_map_flush(lh_map);
>>>>> - missed = true;
>>>>> - }
>>>>> - if (lh_xsk) {
>>>>> - __xsk_map_flush(lh_xsk);
>>>>> - missed = true;
>>>>> - }
>>>>> + do {
>>>>> + redirect = 0;
>>>>> + bpf_net_ctx_get_all_used_flush_lists(&lh_map, &lh_dev, &lh_xsk);
>>>>> + if (lh_dev) {
>>>>> + __dev_flush(lh_dev, &redirect);
>>>>> + missed = true;
>>>>> + }
>>>>> + if (lh_map) {
>>>>> + __cpu_map_flush(lh_map);
>>>>> + missed = true;
>>>>> + }
>>>>> + if (lh_xsk) {
>>>>> + __xsk_map_flush(lh_xsk);
>>>>> + missed = true;
>>>>> + }
>>>>> + } while (redirect > 0);
>>>>
>>>> With the change suggested above (so that bq_xmit_all() guarantees the
>>>> flush is completely done), this looping is not needed anymore. However,
>>>> it becomes important in which *order* the flushing is done
>>>> (__dev_flush() should always happen first), so adding a comment to note
>>>> this would be good.
>>>
>>>
>>> I guess, if we remove the loop here and we still want to cover the case of
>>> redirecting from devmap program via cpumap, we need to fetch the lh_map again
>>> after calling __dev_flush, right?
>>> So I think we should no longer use bpf_net_ctx_get_all_used_flush_lists then:
>>>
>>> lh_dev = bpf_net_ctx_get_dev_flush_list();
>>> if (lh_dev)
>>> __dev_flush(lh_dev);
>>> lh_map = bpf_net_ctx_get_cpu_map_flush_list();
>>> if (lh_map)
>>> __cpu_map_flush(lh_map);
>>> lh_xsk = bpf_net_ctx_get_xskmap_flush_list();
>>> if (lh_xsk)
>>> __xsk_map_flush(lh_xsk);
>>>
>>> But bpf_net_ctx_get_all_used_flush_lists also includes additional checks
>>> for IS_ENABLED(CONFIG_BPF_SYSCALL) and IS_ENABLED(CONFIG_XDP_SOCKETS),
>>> so I guess they should be directly included in the xdp_do_flush and
>>> xdp_do_check_flushed?
>>> Then we could remove the bpf_net_ctx_get_all_used_flush_lists.
>>> Or do you have an idea for a more elegant solution?
>>
>> I think cpumap is fine because that doesn't immediately redirect back to
>> the bulk queue; instead __cpu_map_flush() will put the frames on a
>> per-CPU ring buffer and wake the kworker on that CPU. Which will in turn
>> do another xdp_do_flush() if the cpumap program does a redirect. So in
>> other words, we only need to handle recursion of devmap redirects :)
>
> I likely miss something here, but the scenario I am thinking about is the following:
>
> 1. cpu_map and xsk_map flush list are empty, while the dev flush map has
> a single frame pending, i.e. in the current implementation after
> executing bpf_net_ctx_get_all_used_flush_lists:
> lh_dev = something
> lh_map = lh_xsk = NULL
>
> 2. __dev_flush gets called and executes a bpf program on the frame
> that returns with "return bpf_redirect_map(&cpu_map, 0, 0);"
> and that adds an item to the cpumap flush list.
>
> 3. Since __dev_flush is only able to handle devmap redirects itself,
> the item is still on the cpumap flush list after __dev_flush
> has returned.
>
> 4. lh_map, however, is still NULL, so __cpu_map_flush does not
> get called and thus the kworker is never woken up.
>
> That is at least what I see on the running system that the kworker
> is never woken up, but I might have done something else wrong...

Ah, yes, I see what you mean. Yup, you're right,
bpf_net_ctx_get_all_used_flush_lists() will no longer work, we'll have
to get the others after __dev_flush()

-Toke