Re: [PATCH net-next v2 5/9] tun: use bulk NAPI cache allocation in tun_xdp_one

From: Jesper Dangaard Brouer
Date: Tue Dec 02 2025 - 12:33:00 EST




On 02/12/2025 17.49, Jon Kohler wrote:


On Nov 27, 2025, at 10:02 PM, Jason Wang <jasowang@xxxxxxxxxx> wrote:

On Wed, Nov 26, 2025 at 3:19 AM Jon Kohler <jon@xxxxxxxxxxx> wrote:

Optimize TUN_MSG_PTR batch processing by allocating sk_buff structures
in bulk from the per-CPU NAPI cache using napi_skb_cache_get_bulk.
This reduces allocation overhead and improves efficiency, especially
when IFF_NAPI is enabled and GRO is feeding entries back to the cache.

Does this mean we should only enable this when NAPI is used?

No, it does not mean that at all, but I see what that would be confusing.
I can clean up the commit msg on the next go around


If bulk allocation cannot fully satisfy the batch, gracefully drop only
the uncovered portion, allowing the rest of the batch to proceed, which
is what already happens in the previous case where build_skb() would
fail and return -ENOMEM.

Signed-off-by: Jon Kohler <jon@xxxxxxxxxxx>

Do we have any benchmark result for this?

Yes, it is in the cover letter:
https://patchwork.kernel.org/project/netdevbpf/cover/20251125200041.1565663-1-jon@xxxxxxxxxxx/

---
drivers/net/tun.c | 30 ++++++++++++++++++++++++------
1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 97f130bc5fed..64f944cce517 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
[...]
@@ -2454,6 +2455,7 @@ static int tun_xdp_one(struct tun_struct *tun,
ret = tun_xdp_act(tun, xdp_prog, xdp, act);
if (ret < 0) {
/* tun_xdp_act already handles drop statistics */
+ kfree_skb_reason(skb, SKB_DROP_REASON_XDP);

This should belong to previous patches?

Well, not really, as we did not even have an SKB to free at this point
in the previous code

put_page(virt_to_head_page(xdp->data));

This calling put_page() directly also looks dubious.

return ret;
}
@@ -2463,6 +2465,7 @@ static int tun_xdp_one(struct tun_struct *tun,
*flush = true;
fallthrough;
case XDP_TX:
+ napi_consume_skb(skb, 1);
return 0;
case XDP_PASS:
break;
@@ -2475,13 +2478,15 @@ static int tun_xdp_one(struct tun_struct *tun,
tpage->page = page;
tpage->count = 1;
}
+ napi_consume_skb(skb, 1);

I wonder if this would have any side effects since tun_xdp_one() is
not called by a NAPI.

As far as I can tell, this napi_consume_skb is really just an artifact of
how it was named and how it was traditionally used.

Now this is really just a napi_consume_skb within a bh disable/enable
section, which should meet the requirements of how that interface
should be used (again, AFAICT)


Yicks - this sounds super ugly. Just wrapping napi_consume_skb() in bh
disable/enable section and then assuming you get the same protection as
NAPI is really dubious.

Cc Sebastian as he is trying to cleanup these kind of use-case, to make
kernel preemption work.



return 0;
}
}

build:
- skb = build_skb(xdp->data_hard_start, buflen);
+ skb = build_skb_around(skb, xdp->data_hard_start, buflen);
if (!skb) {
+ kfree_skb_reason(skb, SKB_DROP_REASON_NOMEM);

Though to your point, I dont think this actually does anything given
that if the skb was somehow nuked as part of build_skb_around, there
would not be an skb to free. Doesn’t hurt though, from a self documenting
code perspective tho?

dev_core_stats_rx_dropped_inc(tun->dev);
return -ENOMEM;
}
@@ -2566,9 +2571,11 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
if (m->msg_controllen == sizeof(struct tun_msg_ctl) &&
ctl && ctl->type == TUN_MSG_PTR) {
struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
+ int flush = 0, queued = 0, num_skbs = 0;
struct tun_page tpage;
int n = ctl->num;
- int flush = 0, queued = 0;
+ /* Max size of VHOST_NET_BATCH */
+ void *skbs[64];

I think we need some tweaks

1) TUN is decoupled from vhost, so it should have its own value (a
macro is better)

Sure, I can make another constant that does a similar thing

2) Provide a way to fail or handle the case when more than 64

What if we simply assert that the maximum here is 64, which I think
is what it actually is in practice?



memset(&tpage, 0, sizeof(tpage));

@@ -2576,13 +2583,24 @@ static int tun_sendmsg(struct socket *sock, struct msghdr *m, size_t total_len)
rcu_read_lock();
bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);

- for (i = 0; i < n; i++) {
+ num_skbs = napi_skb_cache_get_bulk(skbs, n);

Its document said:

"""
* Must be called *only* from the BH context.
“"”
We’re in a bh_disable section here, is that not good enough?

Again this feels very ugly and prone to painting ourselves into a
corner, assuming BH-disabled sections have same protection as NAPI.

(The napi_skb_cache_get/put function are operating on per CPU arrays
without any locking.)

--Jesper