Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction

From: Jonas Köppeler

Date: Fri May 22 2026 - 12:31:26 EST

On 5/22/26 10:41, Simon Schippers wrote:

On 5/22/26 09:14, Jonas Köppeler wrote:

On 5/19/26 10:51 PM, Simon Schippers wrote:

On 5/12/26 23:55, Simon Schippers wrote:

On 5/12/26 15:54, Jesper Dangaard Brouer wrote:

Nope, I'm using a bpftrace program to keep track of the inflight/limit
in a BPF hashmap. Reading from /sys will not be accurate.

Ah nice.

Add the option --hist to have both NAPI and BQL histograms printed when
script ends. This will give you an accurate pattern of how inflight and
limit evolves.

I moved the selftests into a github repo [1] to allow us to collaborate
and evaluate the changes more easily. I explicitly kept the new BPF
based BQL tracking as a commit[2] for your benefit.

[1]https://github.com/netoptimizer/veth-backpressure-performance-testing/tree/main/selftests

[2]https://github.com/netoptimizer/veth-backpressure-performance-testing/commit/f25c5dc92977

Thanks for sharing. After minor issues I was able to set it up
(currently I am just using plain v5, will look at the coalescing patch
when I find the time):

Can confirm the latency reduction with the default settings, in my case
4.888ms to 0.241ms.

With the same script I was also able to see a performance slow down:
veth_bql_test_virtme.sh --qdisc fq_codel --nrules 0
--> ~510 Kpps
Same with --bql-disable
--> ~570 Kpps
--> 12% faster

Thanks for running these benchmarks.

Notice that --nrules 0 can easily result in no-queuing (on average),
because the veth NAPI consumer is faster than the producer. You will
likely see BQL inflight=1 and sink reported avg latency very low
(remember it okay that sink get high latency penalty as long at ping
latency remains low, as that show AQM is working).

I ran the benchmarks with --hist and I see what you mean.
I have very similar results.

Is Jonas way [1] of modifiying pktgen maybe the best option to ensure
that the producer is faster than the consumer?

[1] Link:https://lore.kernel.org/netdev/e8cdba04-aa9a-45c6-9807-8274b62920df@xxxxxxxxxxxxxx/

Hi, so what I found is that pktgen does not respect
__QUEUE_STATE_STACK_OFF. So the test data above is invalid, since it
just sent packets even if the BQL "stopped" the queue. So I patched
pktgen with the following:

- if (unlikely(netif_xmit_frozen_or_drv_stopped(txq))) {
+ if (unlikely(netif_xmit_frozen_or_stopped(txq))) {

After thinking more about the implementation I see possible issues:

1. netdev_tx_completed_queue() never reports more than burst=64 packets:

BQL only increments the limit if the queue was starved. That means:
"The queue was over-limit in the last interval (the last time completion
processing ran), and there is no more data in the queue (i.e. it’s
empty)" [2]
But as only 64 packets are reported at max, the queue can only grow when
it is <= 64 packets. And then it can only stay at a limit >64 until the
next decrease of the limit.

2. netdev_tx_completed_queue() is called in irregular intervals:

If the consumer is slow it is called approx each tx_coal_usecs.
But if the consumer is fast it is called way more frequent, probably
in irregular intervals depending on the scheduling.
However, "BQL depends on periodic completion interrupts" [2].

--> How about adding something like an interrupt that triggers every
10us and calls netdev_tx_completed_queue() with n_bql collected from
(multiple) veth_xdp_rcv runs? That could solve 1. and 2.

Hi,

I worked on a new version (see attachment) that addresses both issues.

The major change is that instead of tracking the timestamp and packet
count as local variables in veth_xdp_rcv(), they are now stored
persistently in veth_rq as struct veth_bql_state. This allows completions
to accumulate across multiple NAPI poll calls, so
netdev_tx_completed_queue() can report more than 64 packets at once
(see point 1). To get the time I am using (the fast) sched_clock() with
a trick to avoid issues when switching between CPUs.

For point 2, the coalescing deadline is now checked both before the
receive loop (to flush completions that timed out since the previous
poll) and after each consumed packet, making completion intervals more
regular. Still the intervals can be smaller than
VETH_BQL_COAL_TX_USECS, but I guess this is fine.

I also found out that the BQL limit correlates closely with
VETH_BQL_COAL_TX_USECS. It essentially reflects the latency we are
targeting. I raised the default to 100 µs to allow DQL to converge to a
higher limit (for reaching 255 in the testing below).

With the patched pktgen (respecting __QUEUE_STATE_STACK_XOFF), testing
shows:
- --nrules 0: DQL limit reaches (up to) ~255
- --nrules 10000: DQL limit converges to ~0 (with --gro-disable)

These results are what I would expect from a BQL algorithm, but more
testing is needed of course.

What do you think?

Hi,

This is exactly what I had in mind for implementing the BQL algorithm
in this case. I did some testing with pktgen of this patch and also
compared it to the v5 version.

You can find an extension of the benchmark script with pktgen here [1],
as well as a wrapper script (veth_bql_bench.sh) to run the test script
with and without --bql-disable to report the difference. I also
configured pktgen to use the qdisc as suggested by Jesper.

Great, I will use your pktgen solution from now on.

Didn't know about the qdisc option, is there a performance difference
with/without it? Or is it to have ping working next to pktgen?

Did not see any big difference performance-wise, but as you say, ping
works better with pktgen then.

Consider to do a pull request :)

Note: bpftrace needs to be disabled, otherwise it becomes the
bottleneck (at least on my machine) and pktgen throughput is halved
when enabled.

Good to know.

Here are the results:

v5 (not time-based):
--nrules 0 --pktgen --no-bpftrace
========================================
Results (average over 10 runs):
========================================
BQL on BQL off
--- ------ -------
Throughput (pps) 1980871 2169898
Ping RTT avg (ms) 0.065 0.162
Throughput diff -8.7% // BQL 8.7% lower throughput
RTT diff -59.9% // BQL 60% lower latency
========================================

Simon's time-based version:

Test args: --nrules 0 --pktgen --no-bpftrace
========================================
Results (average over 10 runs):
========================================
BQL on BQL off
--- ------ -------
Throughput (pps) 2166335 2153398
Ping RTT avg (ms) 0.165 0.165
Throughput diff 0.6%
RTT diff 0.0%

--pktgen --no-bpftrace --nrules 3500
========================================
Results (average over 10 runs):
========================================
BQL on BQL off
--- ------ -------
Throughput (pps) 28569 28696
Ping RTT avg (ms) 1.327 8.409
Throughput diff -0.4%
RTT diff -84.2%

I think we should run benchmarks against the stock net-next to
be safe.

--nrules 0 --pktgen --no-bpftrace
========================================
Results (average over 10 runs):
========================================
net-next
--- -------
Throughput (pps) 2285421
Ping RTT avg (ms) 0.161

(slightly adjusted the output to better communicate the results)

So in my case this means BQL implementation has ~5% lower throughput
compared to net-next. But please double check.

Seems to work now as expected.

Yes, but I think we have to keep these points in mind:

1. Limit/Inflight can be bigger than VETH_RING_SIZE, because
packets can be enqueued in the same time as they are read out,
so netdev_tx_completed_queue() can theoretically be called with
many number of packets.
I do not think it is deal-breaking though.
I could see such high limits/inflights when looking at the /sys
BQL statistics..

For me this makes sense, that inflight just means the number of
packets not yet 'completed' or the number of packets that you
can send between two completion calls. I think this is not specific
to this implementation. But for long intervals this might result in
some problems because you can just fill the veth_ring to its capacity
quickly, and increasing latency if the receiver is slow.
To illustrate this have a look at [1]. There are some plots that
show the rtt vs. tx-usec config depending on nrules.

[1] https://github.com/jkoeppeler/veth-backpressure-performance-testing/tree/pktgen-and-benchmark/results/tx-usecs

2. sched_clock() is only valid on the same CPU. When a different
CPU starts executing its sched_clock() can be in the past compared
to the sched_clock() value saved by the previous CPU.
My trick...
min(s->time, sched_clock())
... avoids potentially extremely long intervals between
netdev_tx_completed_queue() calls but is not perfect of course.
I think CPU hopping happens rarely enough for this to matter..
And also we have to keep this in mind [1]:
"An architecture may or may not provide an implementation of
sched_clock() on its own. If a local implementation is not provided,
the system jiffy counter will be used as sched_clock()."

So the problem with this is that with jiffies you have like millisecond
interval granularity, which might be too long in order to work properly.
Given the receiver completes 128 packet in 1ms (queue_completed interval),
the bql will set the limit at 256. Then the tx thread can quickly fill
the ring, and it then basically stopped until the 1ms interval is over.

I don't know how much this matters in practice. Not sure which architectures still hit the jiffies fallback.When in doubt we could disable BQL for those, or pin it to a "virtual queue size" via limit_min / limit_max similar to the v5 behavior and call queue_completed for every packet. Wdyt?

3. Inflight can be stuck at a value>0 for a long time when packet
enqueueing stops. Only when packets are enqueued again,
(on the next veth_xdp_rcv() call,) netdev_tx_completed_queue() is
executed and inflight is set to 0 again.
Can also be seen when looking at the /sys BQL statistics.

This happens if the ring empties before the next queue_completed fires,
right?

Another thing: Is it counterintuitive to set the tx_usec_coal on the
receive device? Because this is a bql related config that is normally
configured on the TX side?

BTW: Yesterday, I worked on and refactored the code into its own .h
file as a library and it also works fine for TUN/TAP (+vhost-net)
for me :)

Nice! Thank you.
Jonas

Thanks for your work!
Simon

[1] Link: https://docs.kernel.org/timers/timekeeping.html

[1]https://github.com/jkoeppeler/veth-backpressure-performance-testing/tree/pktgen-and-benchmark

Thanks,
Jonas

Thanks!

BTW: I think that this implementation could also work for other
software interfaces.

[2] Link:https://medium.com/@tom_84912/byte-queue-limits-the-unauthorized-biography-61adc5730b83

There is an important gotcha. We actually have micro-burst of queuing
(likely due to scheduling noise). Reading BQL stats from /sys will show
BQL inflight=1, but when using the option --hist is it visible that
@inflight have a long tail (see below signature). The "qdisc" output
line also shows this happening via requeues increasing (approx 17/sec in
a test with 567Kpps). (this was with the time-based BQL impl).

I understand..