Re: [RFC 2/2] xdp: Delegate fast path return decision to page_pool

From: Dragos Tatulea

Date: Tue Nov 11 2025 - 13:26:01 EST

On Tue, Nov 11, 2025 at 08:54:37AM +0100, Jesper Dangaard Brouer wrote:
>
>
> On 10/11/2025 19.51, Dragos Tatulea wrote:
> > On Mon, Nov 10, 2025 at 12:06:08PM +0100, Jesper Dangaard Brouer wrote:
> > >
> > >
> > > On 07/11/2025 11.28, Dragos Tatulea wrote:
> > > > XDP uses the BPF_RI_F_RF_NO_DIRECT flag to mark contexts where it is not
> > > > allowed to do direct recycling, even though the direct flag was set by
> > > > the caller. This is confusing and can lead to races which are hard to
> > > > detect [1].
> > > >
> > > > Furthermore, the page_pool already contains an internal
> > > > mechanism which checks if it is safe to switch the direct
> > > > flag from off to on.
> > > >
> > > > This patch drops the use of the BPF_RI_F_RF_NO_DIRECT flag and always
> > > > calls the page_pool release with the direct flag set to false. The
> > > > page_pool will decide if it is safe to do direct recycling. This
> > > > is not free but it is worth it to make the XDP code safer. The
> > > > next paragrapsh are discussing the performance impact.
> > > >
> > > > Performance wise, there are 3 cases to consider. Looking from
> > > > __xdp_return() for MEM_TYPE_PAGE_POOL case:
> > > >
> > > > 1) napi_direct == false:
> > > > - Before: 1 comparison in __xdp_return() + call of
> > > > page_pool_napi_local() from page_pool_put_unrefed_netmem().
> > > > - After: Only one call to page_pool_napi_local().
> > > >
> > > > 2) napi_direct == true && BPF_RI_F_RF_NO_DIRECT
> > > > - Before: 2 comparisons in __xdp_return() + call of
> > > > page_pool_napi_local() from page_pool_put_unrefed_netmem().
> > > > - After: Only one call to page_pool_napi_local().
> > > >
> > > > 3) napi_direct == true && !BPF_RI_F_RF_NO_DIRECT
> > > > - Before: 2 comparisons in __xdp_return().
> > > > - After: One call to page_pool_napi_local()
> > > >
> > > > Case 1 & 2 are the slower paths and they only have to gain.
> > > > But they are slow anyway so the gain is small.
> > > >
> > > > Case 3 is the fast path and is the one that has to be considered more
> > > > closely. The 2 comparisons from __xdp_return() are swapped for the more
> > > > expensive page_pool_napi_local() call.
> > > >
> > > > Using the page_pool benchmark between the fast-path and the
> > > > newly-added NAPI aware mode to measure [2] how expensive
> > > > page_pool_napi_local() is:
> > > >
> > > > bench_page_pool: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
> > > > bench_page_pool: Type:tasklet_page_pool01_fast_path Per elem: 15 cycles(tsc) 7.537 ns (step:0)
> > > >
> > > > bench_page_pool: time_bench_page_pool04_napi_aware(): in_serving_softirq fast-path
> > > > bench_page_pool: Type:tasklet_page_pool04_napi_aware Per elem: 20 cycles(tsc) 10.490 ns (step:0)
> > > >
> > >
> > > IMHO fast-path slowdown is significant. This fast-path is used for the
> > > XDP_DROP use-case in drivers. The fast-path is competing with the speed
> > > of updating an (per-cpu) array and a function-call overhead. The
> > > performance target for XDP_DROP is NIC *wirespeed* which at 100Gbit/s is
> > > 148Mpps (or 6.72ns between packets).
> > >
> > > I still want to seriously entertain this idea, because (1) because the
> > > bug[1] was hard to find, and (2) this is mostly an XDP API optimization
> > > that isn't used by drivers (they call page_pool APIs directly for
> > > XDP_DROP case).
> > > Drivers can do this because they have access to the page_pool instance.
> > >
> > > Thus, this isn't a XDP_DROP use-case.
> > > - This is either XDP_REDIRECT or XDP_TX use-case.
> > >
> > > The primary change in this patch is, changing the XDP API call
> > > xdp_return_frame_rx_napi() effectively to xdp_return_frame().
> > >
> > > Looking at code users of this call:
> > > (A) Seeing a number of drivers using this to speed up XDP_TX when
> > > *completing* packets from TX-ring.
> > > (B) drivers/net/xen-netfront.c use looks incorrect.
> > > (C) drivers/net/virtio_net.c use can easily be removed.
> > > (D) cpumap.c and drivers/net/tun.c should not be using this call.
> > > (E) devmap.c is the main user (with multiple calls)
> > >
> > > The (A) user will see a performance drop for XDP_TX, but these driver
> > > should be able to instead call the page_pool APIs directly as they
> > > should have access to the page_pool instance.
> > >
> > > Users (B)+(C)+(D) simply needs cleanup.
> > >
> > > User (E): devmap is the most important+problematic user (IIRC this was
> > > the cause of bug[1]). XDP redirecting into devmap and running a new
> > > XDP-prog (per target device) was a prime user of this call
> > > xdp_return_frame_rx_napi() as it gave us excellent (e.g. XDP_DROP)
> > > performance.
> > >
> > Thanks for the analysis Jesper.
>
> Thanks for working on this! It is long over due, that we clean this up.
> I think I spotted another bug in veth related to
> xdp_clear_return_frame_no_direct() and when NAPI exits.
>
What is the issue? Besides the fact that the code is using
xdp_return_frame() which doesn't require
xdp_clear_return_frame_no_direct() anyway.

> > > Perhaps we should simply measure the impact on devmap + 2nd XDP-prog
> > > doing XDP_DROP. Then, we can see if overhead is acceptable... ?
> > >
> > Will try. Just to make sure we are on the same page, AFAIU the setup
> > would be:
> > XDP_REDIRECT NIC1 -> veth ingress side and XDP_DROP veth egress side?
>
> No, this isn't exactly what I meant. But the people that wrote this
> blogpost ([1] https://loopholelabs.io/blog/xdp-for-egress-traffic ) is
> dependent on the performance for that scenario with veth pairs.
>
> When doing redirect-map, then you can attach a 2nd XDP-prog per map
> target "egress" device. That 2nd XDP-prog should do a XDP_DROP as that
> will allow us to measure the code path we are talking about. I want test
> to hit this code line [2].
> [2] https://elixir.bootlin.com/linux/v6.17.7/source/kernel/bpf/
> devmap.c#L368.
>
> The xdp-bench[3] tool unfortunately support program-mode for 2nd XDP-
> prog, so I did this code change:
>
> diff --git a/xdp-bench/xdp_redirect_devmap.bpf.c
> b/xdp-bench/xdp_redirect_devmap.bpf.c
> index 0212e824e2fa..39a24f8834e8 100644
> --- a/xdp-bench/xdp_redirect_devmap.bpf.c
> +++ b/xdp-bench/xdp_redirect_devmap.bpf.c
> @@ -76,6 +76,8 @@ int xdp_redirect_devmap_egress(struct xdp_md *ctx)
> struct ethhdr *eth = data;
> __u64 nh_off;
>
> + return XDP_DROP;
> +
> nh_off = sizeof(*eth);
> if (data + nh_off > data_end)
> return XDP_DROP;
>
> [3] https://github.com/xdp-project/xdp-tools/tree/main/xdp-bench
>
> And then you can run thus command:
> sudo ./xdp-bench redirect-map --load-egress mlx5p1 mlx5p1
>
Ah, yes! I was ignorant about the egress part of the program.
That did the trick. The drop happens before reaching the tx
queue of the second netdev and the mentioned code in devmem.c
is reached.

Sender is xdp-trafficgen with 3 threads pushing enough on one RX queue
to saturate the CPU.

Here's what I got:

* before:

eth2->eth3 16,153,328 rx/s 16,153,329 err,drop/s 0 xmit/s
xmit eth2->eth3 0 xmit/s 16,153,329 drop/s 0 drv_err/s 16.00 bulk-avg
eth2->eth3 16,152,538 rx/s 16,152,546 err,drop/s 0 xmit/s
xmit eth2->eth3 0 xmit/s 16,152,546 drop/s 0 drv_err/s 16.00 bulk-avg
eth2->eth3 16,156,331 rx/s 16,156,337 err,drop/s 0 xmit/s
xmit eth2->eth3 0 xmit/s 16,156,337 drop/s 0 drv_err/s 16.00 bulk-avg

* after:

eth2->eth3 16,105,461 rx/s 16,105,469 err,drop/s 0 xmit/s
xmit eth2->eth3 0 xmit/s 16,105,469 drop/s 0 drv_err/s 16.00 bulk-avg
eth2->eth3 16,119,550 rx/s 16,119,541 err,drop/s 0 xmit/s
xmit eth2->eth3 0 xmit/s 16,119,541 drop/s 0 drv_err/s 16.00 bulk-avg
eth2->eth3 16,092,145 rx/s 16,092,154 err,drop/s 0 xmit/s
xmit eth2->eth3 0 xmit/s 16,092,154 drop/s 0 drv_err/s 16.00 bulk-avg

So slightly worse... I don't fully trust the measurements though as I
saw the inverse situation in other tests as well: higher rate after the
patch.

Perf top:

* before:
13.15% [kernel] [k] __xdp_return
11.36% bpf_prog_3f68498fa592198e_redir_devmap_native [k] bpf_prog_3f68498fa592198e_redir_devmap_native
9.60% [mlx5_core] [k] mlx5e_skb_from_cqe_mpwrq_linear
8.19% [mlx5_core] [k] mlx5e_handle_rx_cqe_mpwrq
7.54% [mlx5_core] [k] mlx5e_poll_rx_cq
6.23% [kernel] [k] xdp_do_redirect
5.10% [kernel] [k] page_pool_put_unrefed_netmem
4.86% [mlx5_core] [k] mlx5e_post_rx_mpwqes
4.78% [mlx5_core] [k] mlx5e_xdp_handle
3.87% [kernel] [k] dev_map_bpf_prog_run
2.74% [mlx5_core] [k] mlx5e_page_release_fragmented.isra.0
2.51% [kernel] [k] dev_map_enqueue
2.33% [kernel] [k] dev_map_redirect
2.19% [kernel] [k] page_pool_alloc_netmems
2.18% [kernel] [k] xdp_return_frame_rx_napi
1.75% [kernel] [k] bq_enqueue
1.64% [kernel] [k] bpf_dispatcher_xdp_func
1.37% [kernel] [k] bq_xmit_all
1.34% [kernel] [k] htab_map_update_elem_in_place
1.20% [mlx5_core] [k] mlx5e_poll_ico_cq
1.10% [mlx5_core] [k] mlx5e_free_rx_mpwqe
0.66% bpf_prog_07d302889c674206_tp_xdp_devmap_xmit_multi [k] bpf_prog_07d302889c674206_tp_xdp_devmap_xmit_multi
0.55% bpf_prog_b30cf65b7e0fa9c7_xdp_redirect_devmap_egress [k] bpf_prog_b30cf65b7e0fa9c7_xdp_redirect_devmap_egress
0.40% [kernel] [k] htab_map_hash
0.35% [kernel] [k] __dev_flush

* after:
12.42% [kernel] [k] __xdp_return
10.22% bpf_prog_3f68498fa592198e_redir_devmap_native [k] bpf_prog_3f68498fa592198e_redir_devmap_native
9.04% [mlx5_core] [k] mlx5e_skb_from_cqe_mpwrq_linear
8.34% [mlx5_core] [k] mlx5e_handle_rx_cqe_mpwrq
7.93% [mlx5_core] [k] mlx5e_poll_rx_cq
6.51% [kernel] [k] xdp_do_redirect
5.24% [mlx5_core] [k] mlx5e_post_rx_mpwqes
5.01% [kernel] [k] page_pool_put_unrefed_netmem
5.01% [mlx5_core] [k] mlx5e_xdp_handle
3.76% [kernel] [k] dev_map_bpf_prog_run
2.92% [mlx5_core] [k] mlx5e_page_release_fragmented.isra.0
2.56% [kernel] [k] dev_map_enqueue
2.38% [kernel] [k] dev_map_redirect
2.09% [kernel] [k] page_pool_alloc_netmems
1.70% [kernel] [k] xdp_return_frame
1.67% [kernel] [k] bq_xmit_all
1.66% [kernel] [k] bq_enqueue
1.63% [kernel] [k] bpf_dispatcher_xdp_func
1.27% [kernel] [k] htab_map_update_elem_in_place
1.20% [mlx5_core] [k] mlx5e_free_rx_mpwqe
1.08% [mlx5_core] [k] mlx5e_poll_ico_cq
0.67% bpf_prog_07d302889c674206_tp_xdp_devmap_xmit_multi [k] bpf_prog_07d302889c674206_tp_xdp_devmap_xmit_multi
0.59% [kernel] [k] xdp_return_frame_rx_napi
0.54% bpf_prog_b30cf65b7e0fa9c7_xdp_redirect_devmap_egress [k] bpf_prog_b30cf65b7e0fa9c7_xdp_redirect_devmap_egress
0.46% [kernel] [k] htab_map_hash
0.38% [kernel] [k] __dev_flush
0.35% [kernel] [k] net_rx_action

I both cases pp_alloc_fast == pp_recycle_cached.

> Toke (and I) will appreciate if you added code for this to xdp-bench.
> Supporting a --program-mode like 'redirect-cpu' does.
>
>
Ok. I will add it.

Thanks,
Dragos