Re: [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs

From: Jonas Köppeler

Date: Mon Jun 08 2026 - 09:15:20 EST

On 6/8/26 3:04 PM, Jesper Dangaard Brouer wrote:

On 08/06/2026 12.38, Simon Schippers wrote:

On 5/27/26 15:54, hawk@xxxxxxxxxx wrote:

From: Simon Schippers <simon.schippers@xxxxxxxxxxxxxx>

Per-packet BQL completion forces DQL to converge on limit=2, causing
excessive NAPI scheduling overhead and qdisc requeues.

Accumulate BQL completions and flush them when a configurable time
threshold is exceeded, letting DQL discover a limit that bounds actual
queuing delay to the configured interval. Coalescing state persists
across NAPI polls in struct veth_rq so completions can accumulate
beyond a single budget=64 cycle.

Add ethtool tx-usecs support for runtime tuning. Default is 100 us;
setting tx-usecs to 0 disables coalescing and falls back to per-packet
completion.

   ethtool -C <veth-dev> tx-usecs 500 # 500us coalescing
   ethtool -C <veth-dev> tx-usecs 0    # per-packet (no coalescing)

Co-developed-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx>
Signed-off-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx>
Signed-off-by: Simon Schippers <simon.schippers@xxxxxxxxxxxxxx>
---

I found the issue that n_bql may become infinitly large if producer
and consumer have the same speed (and tx_usecs is large). It could
cause a potential BUG_ON if n_bql grows beyond INT_MAX...
Also I figured that no hardware BQL driver ever completes more than
BQL limit many elements.

Therefore, I propose a simpler logic (see attachment) that completes
either on the usual bql_flush_ns or if n_bql > dql.limit.
If n_bql > dql.limit then we either have the case above that the
producer is as fast as the consumer or we have BQL starvation.

if (state->time + bql_flush_ns <= current_time ||
    state->n_bql > peer_txq->dql.limit) {

It must be n_bql *bigger than* dql.limit because the producer will
always exceed the limit before it stops, see netdev_tx_sent_queue().
It is fast because peer_txq->dql.limit is in the cacheline of the
completion path, see dynamic_queue_limits.h.

Another advantage is that we avoid the snippet checking for empty
and BQL stopped which requires an smp_rmb() and an test_bit().

Apart from that I:
- Always call veth_bql_maybe_complete() in the for loop to have
   more accurate completion intervals when having mixed XDP and
   non-XDP packets.
- Made it so tx_usecs = 0 is now also a normal case.
- Change the type of n_bql to uint instead of int.
- Added _ONCE() for tx_coalesce_usecs as suggested by Paolo.
- Moved the bql_state init in __veth_napi_enable_range() in front
   of napi_enable() to avoid a race (Sashiko).
- Moved the bql_state reset in veth_napi_del_range() after the
   ptr_ring_cleanup() (probably does not matter but makes sense to me)

Benchmarks look just fine, see commit message.

WDYT?

Looks good to me, I will use this in my V7 patchset.

A bike-shedding issue: We change the coalescing parameters for the veth
net_device, but should this be a TX or RX parameter?

Hi,

The results look a little bit suspicious, that in some cases the p99 is smaller than the average, which can happen if you have really large max values. It feels something is off with the methodology. I think we should drop the avg, and include the max value. This would give a better picture.

Is state->n_bql += veth_ptr_is_bql(ptr) something that is commonly done in the kernel? It relies on bool normalization which feels a bit implicit to me.

For physical NICs adjusting TX coalescing will affect the BQL as this
affect the TX completion of the transmitted packets. For veth, it is the
veth-peer's RX NAPI that is "completing" or emptying the ptr_ring, which
is where this patch adds netdev_tx_completed_queue calls for BQL.
Any opinions on the "TX" or "RX" color?

I think I would prefer to configure it on the tx dev, and the recv side gets the value from the peer. Maybe something like this.

@@ -997,11 +998,12 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
struct veth_bql_state *state = &rq->bql_state;
int i, done = 0, n_xdpf = 0;
void *xdpf[VETH_XDP_BATCH];
- struct veth_priv *priv;
- u64 bql_flush_ns;
+ u64 bql_flush_ns = 0;

- priv = netdev_priv(rq->dev);
- bql_flush_ns = (u64)priv->tx_coal_usecs * 1000;
+ if (peer_txq) {
+ struct veth_priv *peer_priv = netdev_priv(peer_txq->dev);
+ bql_flush_ns = (u64)READ_ONCE(peer_priv->tx_coal_usecs) * 1000;
+ }

/* Clamp stored timestamp in case we migrated to a CPU with a behind
* sched_clock(); prevents the deadline from never firing.

- Jonas