Re: [RFC 2/2] xdp: Delegate fast path return decision to page_pool

From: Jesper Dangaard Brouer

Date: Tue Nov 11 2025 - 02:55:09 EST

On 10/11/2025 19.51, Dragos Tatulea wrote:

On Mon, Nov 10, 2025 at 12:06:08PM +0100, Jesper Dangaard Brouer wrote:

On 07/11/2025 11.28, Dragos Tatulea wrote:

XDP uses the BPF_RI_F_RF_NO_DIRECT flag to mark contexts where it is not
allowed to do direct recycling, even though the direct flag was set by
the caller. This is confusing and can lead to races which are hard to
detect [1].

Furthermore, the page_pool already contains an internal
mechanism which checks if it is safe to switch the direct
flag from off to on.

This patch drops the use of the BPF_RI_F_RF_NO_DIRECT flag and always
calls the page_pool release with the direct flag set to false. The
page_pool will decide if it is safe to do direct recycling. This
is not free but it is worth it to make the XDP code safer. The
next paragrapsh are discussing the performance impact.

Performance wise, there are 3 cases to consider. Looking from
__xdp_return() for MEM_TYPE_PAGE_POOL case:

1) napi_direct == false:
- Before: 1 comparison in __xdp_return() + call of
page_pool_napi_local() from page_pool_put_unrefed_netmem().
- After: Only one call to page_pool_napi_local().

2) napi_direct == true && BPF_RI_F_RF_NO_DIRECT
- Before: 2 comparisons in __xdp_return() + call of
page_pool_napi_local() from page_pool_put_unrefed_netmem().
- After: Only one call to page_pool_napi_local().

3) napi_direct == true && !BPF_RI_F_RF_NO_DIRECT
- Before: 2 comparisons in __xdp_return().
- After: One call to page_pool_napi_local()

Case 1 & 2 are the slower paths and they only have to gain.
But they are slow anyway so the gain is small.

Case 3 is the fast path and is the one that has to be considered more
closely. The 2 comparisons from __xdp_return() are swapped for the more
expensive page_pool_napi_local() call.

Using the page_pool benchmark between the fast-path and the
newly-added NAPI aware mode to measure [2] how expensive
page_pool_napi_local() is:

bench_page_pool: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
bench_page_pool: Type:tasklet_page_pool01_fast_path Per elem: 15 cycles(tsc) 7.537 ns (step:0)

bench_page_pool: time_bench_page_pool04_napi_aware(): in_serving_softirq fast-path
bench_page_pool: Type:tasklet_page_pool04_napi_aware Per elem: 20 cycles(tsc) 10.490 ns (step:0)

IMHO fast-path slowdown is significant. This fast-path is used for the
XDP_DROP use-case in drivers. The fast-path is competing with the speed
of updating an (per-cpu) array and a function-call overhead. The
performance target for XDP_DROP is NIC *wirespeed* which at 100Gbit/s is
148Mpps (or 6.72ns between packets).

I still want to seriously entertain this idea, because (1) because the
bug[1] was hard to find, and (2) this is mostly an XDP API optimization
that isn't used by drivers (they call page_pool APIs directly for
XDP_DROP case).
Drivers can do this because they have access to the page_pool instance.

Thus, this isn't a XDP_DROP use-case.
- This is either XDP_REDIRECT or XDP_TX use-case.

The primary change in this patch is, changing the XDP API call
xdp_return_frame_rx_napi() effectively to xdp_return_frame().

Looking at code users of this call:
(A) Seeing a number of drivers using this to speed up XDP_TX when
*completing* packets from TX-ring.
(B) drivers/net/xen-netfront.c use looks incorrect.
(C) drivers/net/virtio_net.c use can easily be removed.
(D) cpumap.c and drivers/net/tun.c should not be using this call.
(E) devmap.c is the main user (with multiple calls)

The (A) user will see a performance drop for XDP_TX, but these driver
should be able to instead call the page_pool APIs directly as they
should have access to the page_pool instance.

Users (B)+(C)+(D) simply needs cleanup.

User (E): devmap is the most important+problematic user (IIRC this was
the cause of bug[1]). XDP redirecting into devmap and running a new
XDP-prog (per target device) was a prime user of this call
xdp_return_frame_rx_napi() as it gave us excellent (e.g. XDP_DROP)
performance.

Thanks for the analysis Jesper.

Thanks for working on this! It is long over due, that we clean this up.
I think I spotted another bug in veth related to
xdp_clear_return_frame_no_direct() and when NAPI exits.

Perhaps we should simply measure the impact on devmap + 2nd XDP-prog
doing XDP_DROP. Then, we can see if overhead is acceptable... ?

Will try. Just to make sure we are on the same page, AFAIU the setup
would be:
XDP_REDIRECT NIC1 -> veth ingress side and XDP_DROP veth egress side?

No, this isn't exactly what I meant. But the people that wrote this
blogpost ([1] https://loopholelabs.io/blog/xdp-for-egress-traffic ) is
dependent on the performance for that scenario with veth pairs.

When doing redirect-map, then you can attach a 2nd XDP-prog per map
target "egress" device. That 2nd XDP-prog should do a XDP_DROP as that
will allow us to measure the code path we are talking about. I want test
to hit this code line [2].
[2] https://elixir.bootlin.com/linux/v6.17.7/source/kernel/bpf/
devmap.c#L368.

The xdp-bench[3] tool unfortunately support program-mode for 2nd XDP-
prog, so I did this code change:

diff --git a/xdp-bench/xdp_redirect_devmap.bpf.c b/xdp-bench/xdp_redirect_devmap.bpf.c
index 0212e824e2fa..39a24f8834e8 100644
--- a/xdp-bench/xdp_redirect_devmap.bpf.c
+++ b/xdp-bench/xdp_redirect_devmap.bpf.c
@@ -76,6 +76,8 @@ int xdp_redirect_devmap_egress(struct xdp_md *ctx)
struct ethhdr *eth = data;
__u64 nh_off;

+ return XDP_DROP;
+
nh_off = sizeof(*eth);
if (data + nh_off > data_end)
return XDP_DROP;

[3] https://github.com/xdp-project/xdp-tools/tree/main/xdp-bench

And then you can run thus command:
sudo ./xdp-bench redirect-map --load-egress mlx5p1 mlx5p1

Toke (and I) will appreciate if you added code for this to xdp-bench.
Supporting a --program-mode like 'redirect-cpu' does.

... and the slow path for reference:

bench_page_pool: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
bench_page_pool: Type:tasklet_page_pool02_ptr_ring Per elem: 30 cycles(tsc) 15.395 ns (step:0)

The devmap user will basically fallback to using this code path.

Yes, if the page_pool is not NAPI aware.

Thanks,
Dragos