Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction

From: Jesper Dangaard Brouer

Date: Thu May 07 2026 - 16:47:51 EST

On 07/05/2026 22.12, Simon Schippers wrote:

On 5/7/26 21:09, Jesper Dangaard Brouer wrote:

On 07/05/2026 16.46, Simon Schippers wrote:

On 5/7/26 16:34, Paolo Abeni wrote:

On 5/7/26 8:54 AM, Simon Schippers wrote:

On 5/5/26 15:21, hawk@xxxxxxxxxx wrote:

@@ -928,9 +968,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
}
} else {
/* ndo_start_xmit */
- struct sk_buff *skb = ptr;
+ bool bql_charged = veth_ptr_is_bql(ptr);
+ struct sk_buff *skb = veth_ptr_to_skb(ptr);
stats->xdp_bytes += skb->len;
+ if (peer_txq && bql_charged)
+ netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);

In the discussion with Jonas [1], I left a comment explaining why I think
this doesn’t work.

I've experimented with doing the "completion" at NAPI-end in
veth_poll(), but that resulted in BQL limit being 128 packets, which
leads to bad latency results (not acceptable).
(See detailed report later)

I still think first that adding an option to modify the hard-coded
VETH_RING_SIZE is the way to go.

Not against being able to modify VETH_RING_SIZE, but I don't think it is
the solution here.

The simply solution is the configure BQL limit_min:
`/sys/class/net/<dev>/queues/tx-N/byte_queue_limits/limit_min`

My experiments (below) find that limit_min=8 is gives good performance.
We can simply set default to 8 as this still allows userspace to change
this later if lower latency is preferred.

Thanks!

[1] Link: https://lore.kernel.org/netdev/e8cdba04-aa9a-45c6-9807-8274b62920df@xxxxxxxxxxxxxx/

In the above discussion a 20% regression is reported, which IMHO can't
be ignored. Still the tput figures in the data are extremely low,
something is possibly off?!? I would expect a few Mpps with pktgen on
top of veth, while the reported data is ~20-30Kpps.

/P

The ~20-30Kpps occur when thousands of iptables rules are applied and
an UDP userspace application is sending.

And there is a 20% pktgen regression (no iptables rules applied).

The pktgen test is a little dubious/weird and Jonas had to modify pktgen
to test this. John Fastabend added a config to pktgen that allows us
to benchmarking egress qdisc path, this might be better to use this.
The samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh is a demo usage.

If redoing the tests, can you adjust limit_min to see the effect?
/sys/class/net/<dev>/queues/tx-N/byte_queue_limits/limit_min

20% throughput performance regression is of-cause too much, but I will
remind us, that adding a qdisc will "cost" some overhead, that is a
configuration choice. Our purpose here is to reduce bufferbloat and
latency, not optimize for throughput.

I am pretty sure the reason is because the BQL limit is stuck at 2
packets (because the completed queue is always called with 1 packet
and not in a interrupt/timer with multiple packets...).

I've run a lot of experiments, which I made AI write a report over, see attachment. The TL;DR is that best performance vs latency tradeoff is defaulting BQL/DQL limit_min to be 8 packets.

I fear this patchset will stall forever, if we keep searching for a perfect solution without any overhead. The qdisc layer will be a baseline overhead. The limit=2 packets is actually the optimal darkbuffer queue size, but I acknowledge that this causes too many qdisc requeue events (leading to overhead). I suggest that I add another patch in V6, that defaults limit_min to 8 (separate patch to make it easier to revert/adjust later).

I've talked with Jonas, and we want to experiment with different solutions to make BQL/DQL work better with virtual devices.

This patchset helps our (production) use-case reduce mice-flow latency
from approx 22ms to 1.3ms for latency under-load. Due to the consumer
namespace being the bottleneck the requeue overhead is negligible in
comparison.

-Jesper

First of all thanks for you work and I really see the advantages of
avoiding bufferbloat :)

But the key of the BQL algorithm, which is the *dynamic* adaption of the
limit, is not working. Always calling netdev_completed_queue() with
1 packet results in a static limit of 2 packets (as seen by Jonas
measurements), which you force up to 8 packets.

So in the end this patchset has the same effect as just setting
VETH_RING_SIZE to 8 (and giving an option to change this value).

I've code up a time based BQL implementation, see attachment.
WDYT?

--Jesper

veth: time-based BQL completion coalescing via ethtool tx-usecs

From: Jesper Dangaard Brouer <hawk@xxxxxxxxxx>

Bufferbloat is fundamentally a latency problem -- what matters is the
time packets spend waiting in queues, as perceived by users and
applications. Base BQL completion coalescing on elapsed time rather
than packet counts to directly control queuing delay.

Add ethtool tx-usecs support to veth for tuning BQL completion
coalescing. Instead of completing BQL per-packet (which forces DQL to
limit=2 with high NAPI scheduling overhead) or per-NAPI-poll (which
over-buffers at budget=64), accumulate completions and flush them when
a configurable time threshold is exceeded. This lets DQL discover a
limit that bounds the actual queuing delay to the configured interval.

The default of 10 usecs (VETH_BQL_COAL_TX_USECS) provides a good
balance: DQL converges to a small limit that keeps queuing delay
bounded while allowing efficient NAPI batch processing. Setting
tx-usecs to 0 disables coalescing and falls back to per-packet
completion (limit=2, lowest latency, highest NAPI overhead).

Usage:
ethtool -C <veth-dev> tx-usecs 500 # 500us coalescing
ethtool -C <veth-dev> tx-usecs 0 # per-packet (no coalescing)

Uses local_clock() (rdtsc on x86, ~20ns) for sub-jiffy resolution.

Signed-off-by: Jesper Dangaard Brouer <hawk@xxxxxxxxxx>
---
drivers/net/veth.c | 59 ++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 57 insertions(+), 2 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 4103d298aa9b..6035f7ec92b4 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -43,6 +43,7 @@

#define VETH_XDP_TX_BULK_SIZE 16
#define VETH_XDP_BATCH 16
+#define VETH_BQL_COAL_TX_USECS 10 /* default tx-usecs for BQL batching */

struct veth_stats {
u64 rx_drops;
@@ -81,6 +82,7 @@ struct veth_priv {
struct bpf_prog *_xdp_prog;
struct veth_rq *rq;
unsigned int requested_headroom;
+ unsigned int tx_coal_usecs; /* BQL completion coalescing */
};

struct veth_xdp_tx_bq {
@@ -265,7 +267,30 @@ static void veth_get_channels(struct net_device *dev,
static int veth_set_channels(struct net_device *dev,
struct ethtool_channels *ch);

+static int veth_get_coalesce(struct net_device *dev,
+ struct ethtool_coalesce *ec,
+ struct kernel_ethtool_coalesce *kernel_coal,
+ struct netlink_ext_ack *extack)
+{
+ struct veth_priv *priv = netdev_priv(dev);
+
+ ec->tx_coalesce_usecs = priv->tx_coal_usecs;
+ return 0;
+}
+
+static int veth_set_coalesce(struct net_device *dev,
+ struct ethtool_coalesce *ec,
+ struct kernel_ethtool_coalesce *kernel_coal,
+ struct netlink_ext_ack *extack)
+{
+ struct veth_priv *priv = netdev_priv(dev);
+
+ priv->tx_coal_usecs = ec->tx_coalesce_usecs;
+ return 0;
+}
+
static const struct ethtool_ops veth_ethtool_ops = {
+ .supported_coalesce_params = ETHTOOL_COALESCE_TX_USECS,
.get_drvinfo = veth_get_drvinfo,
.get_link = ethtool_op_get_link,
.get_strings = veth_get_strings,
@@ -275,6 +300,8 @@ static const struct ethtool_ops veth_ethtool_ops = {
.get_ts_info = ethtool_op_get_ts_info,
.get_channels = veth_get_channels,
.set_channels = veth_set_channels,
+ .get_coalesce = veth_get_coalesce,
+ .set_coalesce = veth_set_coalesce,
};

/* general routines */
@@ -942,8 +969,14 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
struct veth_stats *stats,
struct netdev_queue *peer_txq)
{
+ u64 bql_flush_ns, bql_flush_time = 0;
int i, done = 0, n_xdpf = 0;
void *xdpf[VETH_XDP_BATCH];
+ struct veth_priv *priv;
+ int n_bql = 0;
+
+ priv = netdev_priv(rq->dev);
+ bql_flush_ns = (u64)priv->tx_coal_usecs * 1000;

for (i = 0; i < budget; i++) {
void *ptr = __ptr_ring_consume(&rq->xdp_ring);
@@ -972,8 +1005,23 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
struct sk_buff *skb = veth_ptr_to_skb(ptr);

stats->xdp_bytes += skb->len;
- if (peer_txq && bql_charged)
- netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
+ if (peer_txq && bql_charged) {
+ if (!bql_flush_ns) {
+ netdev_tx_completed_queue(peer_txq, 1,
+ VETH_BQL_UNIT);
+ } else {
+ u64 now = local_clock();
+ n_bql++;
+ if (!bql_flush_time) {
+ bql_flush_time = now;
+ } else if (now - bql_flush_time >= bql_flush_ns) {
+ netdev_tx_completed_queue(peer_txq, n_bql,
+ n_bql * VETH_BQL_UNIT);
+ n_bql = 0;
+ bql_flush_time = 0;
+ }
+ }
+ }

skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
if (skb) {
@@ -989,6 +1037,9 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
if (n_xdpf)
veth_xdp_rcv_bulk_skb(rq, xdpf, n_xdpf, bq, stats);

+ if (peer_txq && n_bql)
+ netdev_tx_completed_queue(peer_txq, n_bql, n_bql * VETH_BQL_UNIT);
+
u64_stats_update_begin(&rq->stats.syncp);
rq->stats.vs.xdp_redirect += stats->xdp_redirect;
rq->stats.vs.xdp_bytes += stats->xdp_bytes;
@@ -1813,6 +1864,8 @@ static const struct xdp_metadata_ops veth_xdp_metadata_ops = {

static void veth_setup(struct net_device *dev)
{
+ struct veth_priv *priv = netdev_priv(dev);
+
ether_setup(dev);

dev->priv_flags &= ~IFF_TX_SKB_SHARING;
@@ -1838,6 +1891,8 @@ static void veth_setup(struct net_device *dev)
dev->max_mtu = ETH_MAX_MTU;
dev->watchdog_timeo = msecs_to_jiffies(16000);

+ priv->tx_coal_usecs = VETH_BQL_COAL_TX_USECS;
+
dev->hw_features = VETH_FEATURES;
dev->hw_enc_features = VETH_FEATURES;
dev->mpls_features = NETIF_F_HW_CSUM | NETIF_F_GSO_SOFTWARE;