Re: [PATCH net-next v6 5/5] veth: time-based BQL completion coalescing via ethtool tx-usecs

From: Jesper Dangaard Brouer

Date: Tue Jun 09 2026 - 10:06:41 EST

On 08/06/2026 16.21, Simon Schippers wrote:

On 6/8/26 15:13, Jonas Köppeler wrote:

On 6/8/26 3:04 PM, Jesper Dangaard Brouer wrote:

On 08/06/2026 12.38, Simon Schippers wrote:

On 5/27/26 15:54, hawk@xxxxxxxxxx wrote:

From: Simon Schippers <simon.schippers@xxxxxxxxxxxxxx>

Per-packet BQL completion forces DQL to converge on limit=2, causing
excessive NAPI scheduling overhead and qdisc requeues.

Accumulate BQL completions and flush them when a configurable time
threshold is exceeded, letting DQL discover a limit that bounds actual
queuing delay to the configured interval. Coalescing state persists
across NAPI polls in struct veth_rq so completions can accumulate
beyond a single budget=64 cycle.

Add ethtool tx-usecs support for runtime tuning. Default is 100 us;
setting tx-usecs to 0 disables coalescing and falls back to per-packet
completion.

ethtool -C <veth-dev> tx-usecs 500 # 500us coalescing
ethtool -C <veth-dev> tx-usecs 0 # per-packet (no coalescing)

Co-developed-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx>
Signed-off-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx>
Signed-off-by: Simon Schippers <simon.schippers@xxxxxxxxxxxxxx>
---

I found the issue that n_bql may become infinitly large if producer
and consumer have the same speed (and tx_usecs is large). It could
cause a potential BUG_ON if n_bql grows beyond INT_MAX...
Also I figured that no hardware BQL driver ever completes more than
BQL limit many elements.

Therefore, I propose a simpler logic (see attachment) that completes
either on the usual bql_flush_ns or if n_bql > dql.limit.
If n_bql > dql.limit then we either have the case above that the
producer is as fast as the consumer or we have BQL starvation.

if (state->time + bql_flush_ns <= current_time ||
state->n_bql > peer_txq->dql.limit) {

It must be n_bql *bigger than* dql.limit because the producer will
always exceed the limit before it stops, see netdev_tx_sent_queue().
It is fast because peer_txq->dql.limit is in the cacheline of the
completion path, see dynamic_queue_limits.h.

Another advantage is that we avoid the snippet checking for empty
and BQL stopped which requires an smp_rmb() and an test_bit().

Apart from that I:
- Always call veth_bql_maybe_complete() in the for loop to have
more accurate completion intervals when having mixed XDP and
non-XDP packets.
- Made it so tx_usecs = 0 is now also a normal case.
- Change the type of n_bql to uint instead of int.
- Added _ONCE() for tx_coalesce_usecs as suggested by Paolo.
- Moved the bql_state init in __veth_napi_enable_range() in front
of napi_enable() to avoid a race (Sashiko).
- Moved the bql_state reset in veth_napi_del_range() after the
ptr_ring_cleanup() (probably does not matter but makes sense to me)

Benchmarks look just fine, see commit message.

WDYT?

Looks good to me, I will use this in my V7 patchset.

A bike-shedding issue: We change the coalescing parameters for the veth
net_device, but should this be a TX or RX parameter?

Hi,

The results look a little bit suspicious, that in some cases the p99 is smaller than the average, which can happen if you have really large max values. It feels something is off with the methodology. I think we should drop the avg, and include the max value. This would give a better picture.

Luckily I saved the raw test result from /tmp (the raw ping output
is included).

extract_ping_p99() seems to be broken. It has issues ordering the
numbers. So if there are 2 or 3 numbers after the point I think.
avg ping is fine.

Claude Sonnet wrote a script for me to recalculate the p99 pings
from the raw ping output).

For programming switch to Claude Opus, as Sonnet cannot code IMHO.
(Claude Opus wrote the selftest we placed in [1])

Is your script still based on our github[1] selftest ?
If so, then please make some PR(s) against my repo.
I want this to be easy reproducible for others.

[1] https://github.com/netoptimizer/veth-backpressure-performance-testing

Below is the new result. Not 100% sure but this seems right now.
Most pings are the same.

Ping RTT ms (p99)
===========================================================================
nrules | 0us | 50us | 100us | 500us | 1000us | 5000us | 10000us || stock
-------+-------+-------+-------+-------+--------+--------+---------++------
0 | 0.028 | 0.115 | 0.159 | 0.161 | 0.163 | 0.161 | 0.163 || 0.154
1 | 0.027 | 0.123 | 0.169 | 0.172 | 0.169 | 0.173 | 0.172 || 0.169
10 | 0.030 | 0.117 | 0.189 | 0.193 | 0.195 | 0.192 | 0.196 || 0.186
100 | 0.045 | 0.134 | 0.233 | 0.368 | 0.365 | 0.369 | 0.361 || 0.358
1000 | 0.230 | 0.300 | 0.412 | 1.26 | 2.11 | 2.12 | 2.13 || 2.07
10000 | 2.04 | 2.55 | 2.54 | 2.72 | 3.77 | 12.1 | 20.1 || 20.3

Is state->n_bql += veth_ptr_is_bql(ptr) something that is commonly done in the kernel? It relies on bool normalization which feels a bit implicit to me.

Ok, we could swap for something like

if (veth_ptr_is_bql(ptr))
state->n_bql+;

I'll see if I can adjust the V7 patch to this feedback.

For physical NICs adjusting TX coalescing will affect the BQL as this
affect the TX completion of the transmitted packets. For veth, it is the
veth-peer's RX NAPI that is "completing" or emptying the ptr_ring, which
is where this patch adds netdev_tx_completed_queue calls for BQL.
Any opinions on the "TX" or "RX" color?

Personally I would also say TX.

Okay, we seem to have more votes for TX :-)

I think I would prefer to configure it on the tx dev, and the recv
side gets the value from the peer. Maybe something like this.

I'm considering if we should simplify the config by having veth pairs have the same tx-coal value. WDYT?

@@ -997,11 +998,12 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
struct veth_bql_state *state = &rq->bql_state;
int i, done = 0, n_xdpf = 0;
void *xdpf[VETH_XDP_BATCH];

Why this deletion?

- struct veth_priv *priv;
- u64 bql_flush_ns;
+ u64 bql_flush_ns = 0;

- priv = netdev_priv(rq->dev);
- bql_flush_ns = (u64)priv->tx_coal_usecs * 1000;

I see we should check if peer_txq exists.

+ if (peer_txq) {
+ struct veth_priv *peer_priv = netdev_priv(peer_txq->dev);
+ bql_flush_ns = (u64)READ_ONCE(peer_priv->tx_coal_usecs) * 1000;
+ }

/* Clamp stored timestamp in case we migrated to a CPU with a behind
* sched_clock(); prevents the deadline from never firing.

- Jonas

--Jesper