Re: [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs

From: Simon Schippers

Date: Mon Jun 08 2026 - 07:08:54 EST

On 5/27/26 15:54, hawk@xxxxxxxxxx wrote:
> From: Simon Schippers <simon.schippers@xxxxxxxxxxxxxx>
>
> Per-packet BQL completion forces DQL to converge on limit=2, causing
> excessive NAPI scheduling overhead and qdisc requeues.
>
> Accumulate BQL completions and flush them when a configurable time
> threshold is exceeded, letting DQL discover a limit that bounds actual
> queuing delay to the configured interval. Coalescing state persists
> across NAPI polls in struct veth_rq so completions can accumulate
> beyond a single budget=64 cycle.
>
> Add ethtool tx-usecs support for runtime tuning. Default is 100 us;
> setting tx-usecs to 0 disables coalescing and falls back to per-packet
> completion.
>
> ethtool -C <veth-dev> tx-usecs 500 # 500us coalescing
> ethtool -C <veth-dev> tx-usecs 0 # per-packet (no coalescing)
>
> Co-developed-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx>
> Signed-off-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx>
> Signed-off-by: Simon Schippers <simon.schippers@xxxxxxxxxxxxxx>
> ---

I found the issue that n_bql may become infinitly large if producer
and consumer have the same speed (and tx_usecs is large). It could
cause a potential BUG_ON if n_bql grows beyond INT_MAX...
Also I figured that no hardware BQL driver ever completes more than
BQL limit many elements.

Therefore, I propose a simpler logic (see attachment) that completes
either on the usual bql_flush_ns or if n_bql > dql.limit.
If n_bql > dql.limit then we either have the case above that the
producer is as fast as the consumer or we have BQL starvation.

if (state->time + bql_flush_ns <= current_time ||
state->n_bql > peer_txq->dql.limit) {

It must be n_bql *bigger than* dql.limit because the producer will
always exceed the limit before it stops, see netdev_tx_sent_queue().
It is fast because peer_txq->dql.limit is in the cacheline of the
completion path, see dynamic_queue_limits.h.

Another advantage is that we avoid the snippet checking for empty
and BQL stopped which requires an smp_rmb() and an test_bit().

Apart from that I:
- Always call veth_bql_maybe_complete() in the for loop to have
more accurate completion intervals when having mixed XDP and
non-XDP packets.
- Made it so tx_usecs = 0 is now also a normal case.
- Change the type of n_bql to uint instead of int.
- Added _ONCE() for tx_coalesce_usecs as suggested by Paolo.
- Moved the bql_state init in __veth_napi_enable_range() in front
of napi_enable() to avoid a race (Sashiko).
- Moved the bql_state reset in veth_napi_del_range() after the
ptr_ring_cleanup() (probably does not matter but makes sense to me)

Benchmarks look just fine, see commit message.

WDYT?

Thanks,
Simon
From 59844f703988805ff7913989ed4dcd427ae882af Mon Sep 17 00:00:00 2001
From: Simon Schippers <simon.schippers@xxxxxxxxxxxxxx>
Date: Wed, 27 May 2026 15:54:16 +0200
Subject: [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing
via ethtool tx-usecs

Per-packet BQL completion forces DQL to converge on limit=2, causing
excessive NAPI scheduling overhead and qdisc requeues.

Accumulate BQL completions and flush them when a configurable time
threshold (tx-usecs) is exceeded, letting DQL discover a limit that
bounds actual queuing delay to the configured interval. Coalescing
state persists across NAPI polls in struct veth_rq so completions can
accumulate beyond a single budget=64 cycle.

The flush condition is:

state->time + bql_flush_ns <= current_time || state->n_bql > dql.limit

Flushing when n_bql exceeds dql.limit handles two cases:
- BQL starvation
- The steady-state case where the producer and consumer run at the
same speed with a large tx-usecs, which would otherwise allow n_bql
to grow without bound (and potentially overflow int).

The comparison is strictly greater-than because netdev_tx_sent_queue()
always lets the producer exceed the limit by one before it stops, so
n_bql == dql.limit is a normal in-flight state. dql.limit lives in
the same cacheline as the completion path, so the check is cheap.

Add ethtool tx-usecs support for runtime tuning. Default is 100 us;
setting tx-usecs to 0 disables coalescing and falls back to per-packet
completion.

ethtool -C <veth-dev> tx-usecs 500 # 500us coalescing
ethtool -C <veth-dev> tx-usecs 0 # per-packet (no coalescing)

Benchmarks (10 runs, Ryzen 5 5600X @ 4.3 GHz, SMT off, 3200 MHz RAM):

Throughput (pps)
===========================================================================
nrules | 0us | 50us | 100us | 500us | 1000us | 5000us | 10000us || stock
-------+-------+-------+-------+-------+--------+--------+---------++------
0 | 1.62M | 1.89M | 1.75M | 1.73M | 1.73M | 1.73M | 1.73M || 1.76M
1 | 1.51M | 1.72M | 1.63M | 1.60M | 1.60M | 1.59M | 1.59M || 1.64M
10 | 1.33M | 1.52M | 1.47M | 1.41M | 1.41M | 1.41M | 1.41M || 1.45M
100 | 675K | 748K | 757K | 722K | 722K | 724K | 729K || 737K
1000 | 117K | 125K | 125K | 126K | 124K | 124K | 124K || 126K
10000 | 13K | 13K | 13K | 13K | 13K | 13K | 13K || 13K

Ping RTT ms (avg)
===========================================================================
nrules | 0us | 50us | 100us | 500us | 1000us | 5000us | 10000us || stock
-------+-------+-------+-------+-------+--------+--------+---------++------
0 | 0.017 | 0.090 | 0.137 | 0.138 | 0.138 | 0.138 | 0.138 || 0.133
1 | 0.017 | 0.097 | 0.145 | 0.146 | 0.144 | 0.148 | 0.147 || 0.143
10 | 0.018 | 0.092 | 0.158 | 0.165 | 0.165 | 0.162 | 0.167 || 0.159
100 | 0.031 | 0.104 | 0.181 | 0.317 | 0.317 | 0.317 | 0.311 || 0.305
1000 | 0.142 | 0.198 | 0.314 | 0.991 | 1.69 | 1.82 | 1.82 || 1.76
10000 | 1.12 | 1.72 | 1.74 | 1.76 | 2.88 | 9.27 | 15.9 || 17.4

Ping RTT ms (p99)
===========================================================================
nrules | 0us | 50us | 100us | 500us | 1000us | 5000us | 10000us || stock
-------+-------+-------+-------+-------+--------+--------+---------++------
0 | 0.028 | 0.115 | 0.159 | 0.161 | 0.163 | 0.161 | 0.163 || 0.154
1 | 0.027 | 0.123 | 0.170 | 0.172 | 0.169 | 0.173 | 0.172 || 0.169
10 | 0.030 | 0.117 | 0.190 | 0.193 | 0.195 | 0.192 | 0.196 || 0.186
100 | 0.045 | 0.134 | 0.231 | 0.368 | 0.365 | 0.370 | 0.361 || 0.358
1000 | 0.230 | 0.300 | 0.408 | 0.989 | 2.11 | 2.12 | 2.13 || 2.07
10000 | 0.979 | 1.59 | 1.26 | 2.06 | 3.77 | 9.87 | 20.1 || 20.3

Co-developed-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx>
Signed-off-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx>
Signed-off-by: Simon Schippers <simon.schippers@xxxxxxxxxxxxxx>
---
drivers/net/veth.c | 93 ++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 90 insertions(+), 3 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index d5675d9d5236..b9179de628a6 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -28,6 +28,7 @@
#include <linux/bpf_trace.h>
#include <linux/net_tstamp.h>
#include <linux/skbuff_ref.h>
+#include <linux/sched/clock.h>
#include <net/page_pool/helpers.h>

#define DRV_NAME "veth"
@@ -44,6 +45,8 @@
#define VETH_XDP_TX_BULK_SIZE 16
#define VETH_XDP_BATCH 16

+#define VETH_BQL_COAL_TX_USECS 100 /* default tx-usecs for BQL batching */
+
struct veth_stats {
u64 rx_drops;
/* xdp */
@@ -62,6 +65,11 @@ struct veth_rq_stats {
struct u64_stats_sync syncp;
};

+struct veth_bql_state {
+ u64 time; /* sched_clock() when current coalescing window started */
+ uint n_bql; /* BQL completions batched in the current window */
+};
+
struct veth_rq {
struct napi_struct xdp_napi;
struct napi_struct __rcu *napi; /* points to xdp_napi when the latter is initialized */
@@ -69,6 +77,7 @@ struct veth_rq {
struct bpf_prog __rcu *xdp_prog;
struct xdp_mem_info xdp_mem;
struct veth_rq_stats stats;
+ struct veth_bql_state bql_state;
bool rx_notify_masked;
struct ptr_ring xdp_ring;
struct xdp_rxq_info xdp_rxq;
@@ -81,6 +90,7 @@ struct veth_priv {
struct bpf_prog *_xdp_prog;
struct veth_rq *rq;
unsigned int requested_headroom;
+ unsigned int tx_coal_usecs; /* BQL completion coalescing */
};

struct veth_xdp_tx_bq {
@@ -265,7 +275,31 @@ static void veth_get_channels(struct net_device *dev,
static int veth_set_channels(struct net_device *dev,
struct ethtool_channels *ch);

+static int veth_get_coalesce(struct net_device *dev,
+ struct ethtool_coalesce *ec,
+ struct kernel_ethtool_coalesce *kernel_coal,
+ struct netlink_ext_ack *extack)
+{
+ struct veth_priv *priv = netdev_priv(dev);
+
+ ec->tx_coalesce_usecs = priv->tx_coal_usecs;
+ return 0;
+}
+
+static int veth_set_coalesce(struct net_device *dev,
+ struct ethtool_coalesce *ec,
+ struct kernel_ethtool_coalesce *kernel_coal,
+ struct netlink_ext_ack *extack)
+{
+ struct veth_priv *priv = netdev_priv(dev);
+
+ /* Paired with READ_ONCE in veth_xdp_rcv(). */
+ WRITE_ONCE(priv->tx_coal_usecs, ec->tx_coalesce_usecs);
+ return 0;
+}
+
static const struct ethtool_ops veth_ethtool_ops = {
+ .supported_coalesce_params = ETHTOOL_COALESCE_TX_USECS,
.get_drvinfo = veth_get_drvinfo,
.get_link = ethtool_op_get_link,
.get_strings = veth_get_strings,
@@ -275,6 +309,8 @@ static const struct ethtool_ops veth_ethtool_ops = {
.get_ts_info = ethtool_op_get_ts_info,
.get_channels = veth_get_channels,
.set_channels = veth_set_channels,
+ .get_coalesce = veth_get_coalesce,
+ .set_coalesce = veth_set_coalesce,
};

/* general routines */
@@ -937,13 +973,56 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
return NULL;
}

+static void veth_bql_maybe_complete(struct veth_bql_state *state,
+ struct netdev_queue *peer_txq,
+ u64 bql_flush_ns)
+{
+ u64 current_time;
+
+ /* There is no reason to complete with 0 and
+ * peer_txq could go away.
+ */
+ if (!state->n_bql || !peer_txq)
+ return;
+
+ current_time = sched_clock();
+
+ /* We complete if:
+ * 1. We reach bql_flush_ns.
+ * 2. We potentially have BQL starvation.
+ */
+ if (state->time + bql_flush_ns <= current_time ||
+ state->n_bql > peer_txq->dql.limit) {
+ netdev_tx_completed_queue(peer_txq, state->n_bql,
+ state->n_bql * VETH_BQL_UNIT);
+ state->time = current_time;
+ state->n_bql = 0;
+ }
+}
+
static int veth_xdp_rcv(struct veth_rq *rq, int budget,
struct veth_xdp_tx_bq *bq,
struct veth_stats *stats,
struct netdev_queue *peer_txq)
{
+ struct veth_bql_state *state = &rq->bql_state;
int i, done = 0, n_xdpf = 0;
void *xdpf[VETH_XDP_BATCH];
+ struct veth_priv *priv;
+ u64 bql_flush_ns;
+
+ priv = netdev_priv(rq->dev);
+
+ /* Paired with WRITE_ONCE() in veth_set_coalesce(). */
+ bql_flush_ns = (u64)READ_ONCE(priv->tx_coal_usecs) * 1000;
+
+ /* Clamp stored timestamp in case we migrated to a CPU with a behind
+ * sched_clock(); tries to reduce late BQL flushes.
+ */
+ state->time = min(state->time, sched_clock());
+
+ /* Flush completions that timed out since the previous NAPI poll. */
+ veth_bql_maybe_complete(state, peer_txq, bql_flush_ns);

for (i = 0; i < budget; i++) {
void *ptr = __ptr_ring_consume(&rq->xdp_ring);
@@ -968,12 +1047,10 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
}
} else {
/* ndo_start_xmit */
- bool bql_charged = veth_ptr_is_bql(ptr);
struct sk_buff *skb = veth_ptr_to_skb(ptr);

+ state->n_bql += veth_ptr_is_bql(ptr);
stats->xdp_bytes += skb->len;
- if (peer_txq && bql_charged)
- netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);

skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
if (skb) {
@@ -983,6 +1060,7 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
napi_gro_receive(&rq->xdp_napi, skb);
}
}
+ veth_bql_maybe_complete(state, peer_txq, bql_flush_ns);
done++;
}

@@ -1091,6 +1169,9 @@ static int __veth_napi_enable_range(struct net_device *dev, int start, int end)
for (i = start; i < end; i++) {
struct veth_rq *rq = &priv->rq[i];

+ rq->bql_state.time = sched_clock();
+ rq->bql_state.n_bql = 0;
+
napi_enable(&rq->xdp_napi);
rcu_assign_pointer(priv->rq[i].napi, &priv->rq[i].xdp_napi);
}
@@ -1135,6 +1216,8 @@ static void veth_napi_del_range(struct net_device *dev, int start, int end)

rq->rx_notify_masked = false;
ptr_ring_cleanup(&rq->xdp_ring, veth_ptr_free);
+ rq->bql_state.n_bql = 0;
+ rq->bql_state.time = 0;
}

/* Reset BQL and wake stopped peer txqs. A concurrent veth_xmit()
@@ -1813,6 +1896,8 @@ static const struct xdp_metadata_ops veth_xdp_metadata_ops = {

static void veth_setup(struct net_device *dev)
{
+ struct veth_priv *priv = netdev_priv(dev);
+
ether_setup(dev);

dev->priv_flags &= ~IFF_TX_SKB_SHARING;
@@ -1838,6 +1923,8 @@ static void veth_setup(struct net_device *dev)
dev->max_mtu = ETH_MAX_MTU;
dev->watchdog_timeo = msecs_to_jiffies(16000);

+ priv->tx_coal_usecs = VETH_BQL_COAL_TX_USECS;
+
dev->hw_features = VETH_FEATURES;
dev->hw_enc_features = VETH_FEATURES;
dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;
--
2.43.0