Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction

From: Simon Schippers

Date: Tue May 19 2026 - 17:00:48 EST

On 5/12/26 23:55, Simon Schippers wrote:
> On 5/12/26 15:54, Jesper Dangaard Brouer wrote:
>>>> Nope, I'm using a bpftrace program to keep track of the inflight/limit
>>>> in a BPF hashmap. Reading from /sys will not be accurate.
>>>
>>> Ah nice.
>>
>> Add the option --hist to have both NAPI and BQL histograms printed when
>> script ends. This will give you an accurate pattern of how inflight and
>> limit evolves.
>>
>>>>
>>>> I moved the selftests into a github repo [1] to allow us to collaborate
>>>> and evaluate the changes more easily. I explicitly kept the new BPF
>>>> based BQL tracking as a commit[2] for your benefit.
>>>>
>>>> [1] https://github.com/netoptimizer/veth-backpressure-performance-testing/tree/main/selftests
>>>>
>>>> [2] https://github.com/netoptimizer/veth-backpressure-performance-testing/commit/f25c5dc92977
>>>
>>> Thanks for sharing. After minor issues I was able to set it up
>>> (currently I am just using plain v5, will look at the coalescing patch
>>> when I find the time):
>>>
>>> Can confirm the latency reduction with the default settings, in my case
>>> 4.888ms to 0.241ms.
>>>
>>> With the same script I was also able to see a performance slow down:
>>> veth_bql_test_virtme.sh --qdisc fq_codel --nrules 0
>>> --> ~510 Kpps
>>> Same with --bql-disable
>>> --> ~570 Kpps
>>> --> 12% faster
>>>
>>
>> Thanks for running these benchmarks.
>>
>> Notice that --nrules 0 can easily result in no-queuing (on average),
>> because the veth NAPI consumer is faster than the producer. You will
>> likely see BQL inflight=1 and sink reported avg latency very low
>> (remember it okay that sink get high latency penalty as long at ping
>> latency remains low, as that show AQM is working).
>
> I ran the benchmarks with --hist and I see what you mean.
> I have very similar results.
>
> Is Jonas way [1] of modifiying pktgen maybe the best option to ensure
> that the producer is faster than the consumer?
>
> [1] Link: https://lore.kernel.org/netdev/e8cdba04-aa9a-45c6-9807-8274b62920df@xxxxxxxxxxxxxx/
>
>> Hi, so what I found is that pktgen does not respect
>> __QUEUE_STATE_STACK_OFF. So the test data above is invalid, since it
>> just sent packets even if the BQL "stopped" the queue. So I patched
>> pktgen with the following:
>>
>> - if (unlikely(netif_xmit_frozen_or_drv_stopped(txq))) {
>> + if (unlikely(netif_xmit_frozen_or_stopped(txq))) {
>
>
> After thinking more about the implementation I see possible issues:
>
> 1. netdev_tx_completed_queue() never reports more than burst=64 packets:
>
> BQL only increments the limit if the queue was starved. That means:
> "The queue was over-limit in the last interval (the last time completion
> processing ran), and there is no more data in the queue (i.e. it’s
> empty)" [2]
> But as only 64 packets are reported at max, the queue can only grow when
> it is <= 64 packets. And then it can only stay at a limit >64 until the
> next decrease of the limit.
>
>
> 2. netdev_tx_completed_queue() is called in irregular intervals:
>
> If the consumer is slow it is called approx each tx_coal_usecs.
> But if the consumer is fast it is called way more frequent, probably
> in irregular intervals depending on the scheduling.
> However, "BQL depends on periodic completion interrupts" [2].
>
> --> How about adding something like an interrupt that triggers every
> 10us and calls netdev_tx_completed_queue() with n_bql collected from
> (multiple) veth_xdp_rcv runs? That could solve 1. and 2.

Hi,

I worked on a new version (see attachment) that addresses both issues.

The major change is that instead of tracking the timestamp and packet
count as local variables in veth_xdp_rcv(), they are now stored
persistently in veth_rq as struct veth_bql_state. This allows completions
to accumulate across multiple NAPI poll calls, so
netdev_tx_completed_queue() can report more than 64 packets at once
(see point 1). To get the time I am using (the fast) sched_clock() with
a trick to avoid issues when switching between CPUs.

For point 2, the coalescing deadline is now checked both before the
receive loop (to flush completions that timed out since the previous
poll) and after each consumed packet, making completion intervals more
regular. Still the intervals can be smaller than
VETH_BQL_COAL_TX_USECS, but I guess this is fine.

I also found out that the BQL limit correlates closely with
VETH_BQL_COAL_TX_USECS. It essentially reflects the latency we are
targeting. I raised the default to 100 µs to allow DQL to converge to a
higher limit (for reaching 255 in the testing below).

With the patched pktgen (respecting __QUEUE_STATE_STACK_XOFF), testing
shows:
- --nrules 0: DQL limit reaches (up to) ~255
- --nrules 10000: DQL limit converges to ~0 (with --gro-disable)

These results are what I would expect from a BQL algorithm, but more
testing is needed of course.

What do you think?

Thanks!

BTW: I think that this implementation could also work for other
software interfaces.

>
> [2] Link: https://medium.com/@tom_84912/byte-queue-limits-the-unauthorized-biography-61adc5730b83
>
>>
>> There is an important gotcha. We actually have micro-burst of queuing
>> (likely due to scheduling noise). Reading BQL stats from /sys will show
>> BQL inflight=1, but when using the option --hist is it visible that
>> @inflight have a long tail (see below signature). The "qdisc" output
>> line also shows this happening via requeues increasing (approx 17/sec in
>> a test with 567Kpps). (this was with the time-based BQL impl).
>
> I understand..
>
From 4af80d3db4b828c6ffdb81a44c8b6227b318e7da Mon Sep 17 00:00:00 2001
From: Simon Schippers <simon.schippers@xxxxxxxxxxxxxx>
Date: Tue, 12 May 2026 17:34:54 +0200
Subject: [PATCH] veth: time-based BQL completion coalescing via ethtool
tx-usecs

Bufferbloat is fundamentally a latency problem -- what matters is the
time packets spend waiting in queues, as perceived by users and
applications. Base BQL completion coalescing on elapsed time rather
than packet counts to directly control queuing delay.

Add ethtool tx-usecs support to veth for tuning BQL completion
coalescing. Instead of completing BQL per-packet (which forces DQL to
limit=2 with high NAPI scheduling overhead) or per-NAPI-poll (which
over-buffers at budget=64), accumulate completions and flush them when
a configurable time threshold is exceeded. This lets DQL discover a
limit that bounds the actual queuing delay to the configured interval.

Coalescing state (timestamp + pending count) is stored in struct
veth_bql_state embedded in veth_rq, persisting across NAPI polls so
that completions whose deadline expired in a previous poll are released
at the start of the next one.

Two helpers centralise the logic:
veth_bql_complete() - flush + reset state
veth_bql_maybe_complete() - flush only if deadline elapsed

A min(state->time, sched_clock()) guard at veth_xdp_rcv() entry
prevents the coalescing deadline from stalling if NAPI migrates to a
CPU whose sched_clock() is slightly behind the stored timestamp.

At end-of-function, pending completions are released immediately only
when the ring has drained and __QUEUE_STATE_STACK_XOFF is set, ensuring
DQL backpressure is lifted without bypassing coalescing in the common
case. The smp_rmb() is paired with smp_wmb() in __ptr_ring_produce().

bql_state is initialised at NAPI enable (sched_clock() timestamp) and
cleared at NAPI teardown, where netdev_tx_reset_queue() already resets
DQL on the peer side.

Default tx-usecs raised to 100 us (was 10 us) to allow DQL to converge
to a more useful limit. Setting tx-usecs to 0 disables coalescing and
falls back to per-packet completion (limit=2, lowest latency).

Usage:
ethtool -C <veth-dev> tx-usecs 500 # 500us coalescing
ethtool -C <veth-dev> tx-usecs 0 # per-packet (no coalescing)

Developed-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx>
Developed-by: Simon Schippers <simon.schippers@xxxxxxxxxxxxxx>
Signed-off-by: Simon Schippers <simon.schippers@xxxxxxxxxxxxxx>
---
drivers/net/veth.c | 95 +++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 93 insertions(+), 2 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 4103d298aa9b..c9d53ea9598b 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -43,6 +43,7 @@

#define VETH_XDP_TX_BULK_SIZE 16
#define VETH_XDP_BATCH 16
+#define VETH_BQL_COAL_TX_USECS 100 /* default tx-usecs for BQL batching */

struct veth_stats {
u64 rx_drops;
@@ -62,6 +63,11 @@ struct veth_rq_stats {
struct u64_stats_sync syncp;
};

+struct veth_bql_state {
+ u64 time; /* sched_clock() when current coalescing window started */
+ int n_bql; /* BQL completions batched in the current window */
+};
+
struct veth_rq {
struct napi_struct xdp_napi;
struct napi_struct __rcu *napi; /* points to xdp_napi when the latter is initialized */
@@ -69,6 +75,7 @@ struct veth_rq {
struct bpf_prog __rcu *xdp_prog;
struct xdp_mem_info xdp_mem;
struct veth_rq_stats stats;
+ struct veth_bql_state bql_state;
bool rx_notify_masked;
struct ptr_ring xdp_ring;
struct xdp_rxq_info xdp_rxq;
@@ -81,6 +88,7 @@ struct veth_priv {
struct bpf_prog *_xdp_prog;
struct veth_rq *rq;
unsigned int requested_headroom;
+ unsigned int tx_coal_usecs; /* BQL completion coalescing */
};

struct veth_xdp_tx_bq {
@@ -265,7 +273,30 @@ static void veth_get_channels(struct net_device *dev,
static int veth_set_channels(struct net_device *dev,
struct ethtool_channels *ch);

+static int veth_get_coalesce(struct net_device *dev,
+ struct ethtool_coalesce *ec,
+ struct kernel_ethtool_coalesce *kernel_coal,
+ struct netlink_ext_ack *extack)
+{
+ struct veth_priv *priv = netdev_priv(dev);
+
+ ec->tx_coalesce_usecs = priv->tx_coal_usecs;
+ return 0;
+}
+
+static int veth_set_coalesce(struct net_device *dev,
+ struct ethtool_coalesce *ec,
+ struct kernel_ethtool_coalesce *kernel_coal,
+ struct netlink_ext_ack *extack)
+{
+ struct veth_priv *priv = netdev_priv(dev);
+
+ priv->tx_coal_usecs = ec->tx_coalesce_usecs;
+ return 0;
+}
+
static const struct ethtool_ops veth_ethtool_ops = {
+ .supported_coalesce_params = ETHTOOL_COALESCE_TX_USECS,
.get_drvinfo = veth_get_drvinfo,
.get_link = ethtool_op_get_link,
.get_strings = veth_get_strings,
@@ -275,6 +306,8 @@ static const struct ethtool_ops veth_ethtool_ops = {
.get_ts_info = ethtool_op_get_ts_info,
.get_channels = veth_get_channels,
.set_channels = veth_set_channels,
+ .get_coalesce = veth_get_coalesce,
+ .set_coalesce = veth_set_coalesce,
};

/* general routines */
@@ -937,13 +970,45 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
return NULL;
}

+static void veth_bql_complete(struct veth_bql_state *state,
+ struct netdev_queue *peer_txq)
+{
+ netdev_tx_completed_queue(peer_txq, state->n_bql,
+ state->n_bql * VETH_BQL_UNIT);
+ state->n_bql = 0;
+ state->time = sched_clock();
+}
+
+static void veth_bql_maybe_complete(struct veth_bql_state *state,
+ struct netdev_queue *peer_txq,
+ u64 coalescing_ns)
+{
+ if (state->n_bql && sched_clock() >= state->time + coalescing_ns)
+ veth_bql_complete(state, peer_txq);
+}
+
static int veth_xdp_rcv(struct veth_rq *rq, int budget,
struct veth_xdp_tx_bq *bq,
struct veth_stats *stats,
struct netdev_queue *peer_txq)
{
+ struct veth_bql_state *state = &rq->bql_state;
int i, done = 0, n_xdpf = 0;
void *xdpf[VETH_XDP_BATCH];
+ struct veth_priv *priv;
+ u64 bql_flush_ns;
+
+ priv = netdev_priv(rq->dev);
+ bql_flush_ns = (u64)priv->tx_coal_usecs * 1000;
+
+ /* Clamp stored timestamp in case we migrated to a CPU with a behind
+ * sched_clock(); prevents the deadline from never firing.
+ */
+ state->time = min(state->time, sched_clock());
+
+ /* Flush completions that timed out since the previous NAPI poll. */
+ if (peer_txq && bql_flush_ns)
+ veth_bql_maybe_complete(state, peer_txq, bql_flush_ns);

for (i = 0; i < budget; i++) {
void *ptr = __ptr_ring_consume(&rq->xdp_ring);
@@ -972,8 +1037,16 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
struct sk_buff *skb = veth_ptr_to_skb(ptr);

stats->xdp_bytes += skb->len;
- if (peer_txq && bql_charged)
- netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
+ if (peer_txq && bql_charged) {
+ if (!bql_flush_ns) {
+ netdev_tx_completed_queue(peer_txq, 1,
+ VETH_BQL_UNIT);
+ } else {
+ state->n_bql++;
+ veth_bql_maybe_complete(state, peer_txq,
+ bql_flush_ns);
+ }
+ }

skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
if (skb) {
@@ -989,6 +1062,16 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
if (n_xdpf)
veth_xdp_rcv_bulk_skb(rq, xdpf, n_xdpf, bq, stats);

+ /* If the ring is now empty and the peer TX queue is stalled by DQL
+ * backpressure, release completions immediately to unblock it.
+ * The smp_rmb() is paired with smp_wmb() in __ptr_ring_produce().
+ */
+ if (peer_txq && state->n_bql && ptr_ring_empty(&rq->xdp_ring)) {
+ smp_rmb();
+ if (test_bit(__QUEUE_STATE_STACK_XOFF, &peer_txq->state))
+ veth_bql_complete(state, peer_txq);
+ }
+
u64_stats_update_begin(&rq->stats.syncp);
rq->stats.vs.xdp_redirect += stats->xdp_redirect;
rq->stats.vs.xdp_bytes += stats->xdp_bytes;
@@ -1093,6 +1176,8 @@ static int __veth_napi_enable_range(struct net_device *dev, int start, int end)

napi_enable(&rq->xdp_napi);
rcu_assign_pointer(priv->rq[i].napi, &priv->rq[i].xdp_napi);
+ rq->bql_state.time = sched_clock();
+ rq->bql_state.n_bql = 0;
}

return 0;
@@ -1134,6 +1219,8 @@ static void veth_napi_del_range(struct net_device *dev, int start, int end)
struct veth_rq *rq = &priv->rq[i];

rq->rx_notify_masked = false;
+ rq->bql_state.n_bql = 0;
+ rq->bql_state.time = 0;
ptr_ring_cleanup(&rq->xdp_ring, veth_ptr_free);
}

@@ -1813,6 +1900,8 @@ static const struct xdp_metadata_ops veth_xdp_metadata_ops = {

static void veth_setup(struct net_device *dev)
{
+ struct veth_priv *priv = netdev_priv(dev);
+
ether_setup(dev);

dev->priv_flags &= ~IFF_TX_SKB_SHARING;
@@ -1838,6 +1927,8 @@ static void veth_setup(struct net_device *dev)
dev->max_mtu = ETH_MAX_MTU;
dev->watchdog_timeo = msecs_to_jiffies(16000);

+ priv->tx_coal_usecs = VETH_BQL_COAL_TX_USECS;
+
dev->hw_features = VETH_FEATURES;
dev->hw_enc_features = VETH_FEATURES;
dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;
--
2.43.0