Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction
From: Jonas Köppeler
Date: Fri May 22 2026 - 03:17:42 EST
On 5/19/26 10:51 PM, Simon Schippers wrote:
On 5/12/26 23:55, Simon Schippers wrote:
On 5/12/26 15:54, Jesper Dangaard Brouer wrote:Hi,
I ran the benchmarks with --hist and I see what you mean.Add the option --hist to have both NAPI and BQL histograms printed whenNope, I'm using a bpftrace program to keep track of the inflight/limitAh nice.
in a BPF hashmap. Reading from /sys will not be accurate.
script ends. This will give you an accurate pattern of how inflight and
limit evolves.
Thanks for running these benchmarks.I moved the selftests into a github repo [1] to allow us to collaborateThanks for sharing. After minor issues I was able to set it up
and evaluate the changes more easily. I explicitly kept the new BPF
based BQL tracking as a commit[2] for your benefit.
[1] https://github.com/netoptimizer/veth-backpressure-performance-testing/tree/main/selftests
[2] https://github.com/netoptimizer/veth-backpressure-performance-testing/commit/f25c5dc92977
(currently I am just using plain v5, will look at the coalescing patch
when I find the time):
Can confirm the latency reduction with the default settings, in my case
4.888ms to 0.241ms.
With the same script I was also able to see a performance slow down:
veth_bql_test_virtme.sh --qdisc fq_codel --nrules 0
--> ~510 Kpps
Same with --bql-disable
--> ~570 Kpps
--> 12% faster
Notice that --nrules 0 can easily result in no-queuing (on average),
because the veth NAPI consumer is faster than the producer. You will
likely see BQL inflight=1 and sink reported avg latency very low
(remember it okay that sink get high latency penalty as long at ping
latency remains low, as that show AQM is working).
I have very similar results.
Is Jonas way [1] of modifiying pktgen maybe the best option to ensure
that the producer is faster than the consumer?
[1] Link: https://lore.kernel.org/netdev/e8cdba04-aa9a-45c6-9807-8274b62920df@xxxxxxxxxxxxxx/
Hi, so what I found is that pktgen does not respect
__QUEUE_STATE_STACK_OFF. So the test data above is invalid, since it
just sent packets even if the BQL "stopped" the queue. So I patched
pktgen with the following:
- if (unlikely(netif_xmit_frozen_or_drv_stopped(txq))) {
+ if (unlikely(netif_xmit_frozen_or_stopped(txq))) {
After thinking more about the implementation I see possible issues:
1. netdev_tx_completed_queue() never reports more than burst=64 packets:
BQL only increments the limit if the queue was starved. That means:
"The queue was over-limit in the last interval (the last time completion
processing ran), and there is no more data in the queue (i.e. it’s
empty)" [2]
But as only 64 packets are reported at max, the queue can only grow when
it is <= 64 packets. And then it can only stay at a limit >64 until the
next decrease of the limit.
2. netdev_tx_completed_queue() is called in irregular intervals:
If the consumer is slow it is called approx each tx_coal_usecs.
But if the consumer is fast it is called way more frequent, probably
in irregular intervals depending on the scheduling.
However, "BQL depends on periodic completion interrupts" [2].
--> How about adding something like an interrupt that triggers every
10us and calls netdev_tx_completed_queue() with n_bql collected from
(multiple) veth_xdp_rcv runs? That could solve 1. and 2.
I worked on a new version (see attachment) that addresses both issues.
The major change is that instead of tracking the timestamp and packet
count as local variables in veth_xdp_rcv(), they are now stored
persistently in veth_rq as struct veth_bql_state. This allows completions
to accumulate across multiple NAPI poll calls, so
netdev_tx_completed_queue() can report more than 64 packets at once
(see point 1). To get the time I am using (the fast) sched_clock() with
a trick to avoid issues when switching between CPUs.
For point 2, the coalescing deadline is now checked both before the
receive loop (to flush completions that timed out since the previous
poll) and after each consumed packet, making completion intervals more
regular. Still the intervals can be smaller than
VETH_BQL_COAL_TX_USECS, but I guess this is fine.
I also found out that the BQL limit correlates closely with
VETH_BQL_COAL_TX_USECS. It essentially reflects the latency we are
targeting. I raised the default to 100 µs to allow DQL to converge to a
higher limit (for reaching 255 in the testing below).
With the patched pktgen (respecting __QUEUE_STATE_STACK_XOFF), testing
shows:
- --nrules 0: DQL limit reaches (up to) ~255
- --nrules 10000: DQL limit converges to ~0 (with --gro-disable)
These results are what I would expect from a BQL algorithm, but more
testing is needed of course.
What do you think?
Hi,
This is exactly what I had in mind for implementing the BQL algorithm
in this case. I did some testing with pktgen of this patch and also
compared it to the v5 version.
You can find an extension of the benchmark script with pktgen here [1],
as well as a wrapper script (veth_bql_bench.sh) to run the test script
with and without --bql-disable to report the difference. I also
configured pktgen to use the qdisc as suggested by Jesper.
Note: bpftrace needs to be disabled, otherwise it becomes the
bottleneck (at least on my machine) and pktgen throughput is halved
when enabled.
Here are the results:
v5 (not time-based):
--nrules 0 --pktgen --no-bpftrace
========================================
Results (average over 10 runs):
========================================
BQL on BQL off
--- ------ -------
Throughput (pps) 1980871 2169898
Ping RTT avg (ms) 0.065 0.162
Throughput diff -8.7% // BQL 8.7% lower throughput
RTT diff -59.9% // BQL 60% lower latency
========================================
Simon's time-based version:
Test args: --nrules 0 --pktgen --no-bpftrace
========================================
Results (average over 10 runs):
========================================
BQL on BQL off
--- ------ -------
Throughput (pps) 2166335 2153398
Ping RTT avg (ms) 0.165 0.165
Throughput diff 0.6%
RTT diff 0.0%
--pktgen --no-bpftrace --nrules 3500
========================================
Results (average over 10 runs):
========================================
BQL on BQL off
--- ------ -------
Throughput (pps) 28569 28696
Ping RTT avg (ms) 1.327 8.409
Throughput diff -0.4%
RTT diff -84.2%
Seems to work now as expected.
[1]https://github.com/jkoeppeler/veth-backpressure-performance-testing/tree/pktgen-and-benchmark
Thanks,
Jonas
Thanks!
BTW: I think that this implementation could also work for other
software interfaces.
[2] Link: https://medium.com/@tom_84912/byte-queue-limits-the-unauthorized-biography-61adc5730b83
There is an important gotcha. We actually have micro-burst of queuingI understand..
(likely due to scheduling noise). Reading BQL stats from /sys will show
BQL inflight=1, but when using the option --hist is it visible that
@inflight have a long tail (see below signature). The "qdisc" output
line also shows this happening via requeues increasing (approx 17/sec in
a test with 567Kpps). (this was with the time-based BQL impl).