Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction

From: Simon Schippers

Date: Tue May 26 2026 - 11:46:52 EST

On 5/26/26 16:55, Jonas Köppeler wrote:
> On 5/26/26 4:35 PM, Simon Schippers wrote:
>> On 5/26/26 11:54, Jonas Köppeler wrote:
>>> On 5/23/26 6:09 PM, Simon Schippers wrote:
>>>> On 5/22/26 18:26, Jonas Köppeler wrote:
>>>>> On 5/22/26 10:41, Simon Schippers wrote:
>>>>>> On 5/22/26 09:14, Jonas Köppeler wrote:
>>>>>>> On 5/19/26 10:51 PM, Simon Schippers wrote:
>>>>>>>> On 5/12/26 23:55, Simon Schippers wrote:
>>>>>>>>> On 5/12/26 15:54, Jesper Dangaard Brouer wrote:
>>>>>>>>>>>> Nope, I'm using a bpftrace program to keep track of the inflight/limit
>>>>>>>>>>>> in a BPF hashmap. Reading from /sys will not be accurate.
>>>>>>>>>>> Ah nice.
>>>>>>>>>> Add the option --hist to have both NAPI and BQL histograms printed when
>>>>>>>>>> script ends. This will give you an accurate pattern of how inflight and
>>>>>>>>>> limit evolves.
>>>>>>>>>>
>>>>>>>>>>>> I moved the selftests into a github repo [1] to allow us to collaborate
>>>>>>>>>>>> and evaluate the changes more easily. I explicitly kept the new BPF
>>>>>>>>>>>> based BQL tracking as a commit[2] for your benefit.
>>>>>>>>>>>>
>>>>>>>>>>>> [1]https://github.com/netoptimizer/veth-backpressure-performance-testing/tree/main/selftests
>>>>>>>>>>>>
>>>>>>>>>>>> [2]https://github.com/netoptimizer/veth-backpressure-performance-testing/commit/f25c5dc92977
>>>>>>>>>>> Thanks for sharing. After minor issues I was able to set it up
>>>>>>>>>>> (currently I am just using plain v5, will look at the coalescing patch
>>>>>>>>>>> when I find the time):
>>>>>>>>>>>
>>>>>>>>>>> Can confirm the latency reduction with the default settings, in my case
>>>>>>>>>>> 4.888ms to 0.241ms.
>>>>>>>>>>>
>>>>>>>>>>> With the same script I was also able to see a performance slow down:
>>>>>>>>>>> veth_bql_test_virtme.sh --qdisc fq_codel --nrules 0
>>>>>>>>>>> --> ~510 Kpps
>>>>>>>>>>> Same with --bql-disable
>>>>>>>>>>> --> ~570 Kpps
>>>>>>>>>>> --> 12% faster
>>>>>>>>>>>
>>>>>>>>>> Thanks for running these benchmarks.
>>>>>>>>>>
>>>>>>>>>> Notice that --nrules 0 can easily result in no-queuing (on average),
>>>>>>>>>> because the veth NAPI consumer is faster than the producer. You will
>>>>>>>>>> likely see BQL inflight=1 and sink reported avg latency very low
>>>>>>>>>> (remember it okay that sink get high latency penalty as long at ping
>>>>>>>>>> latency remains low, as that show AQM is working).
>>>>>>>>> I ran the benchmarks with --hist and I see what you mean.
>>>>>>>>> I have very similar results.
>>>>>>>>>
>>>>>>>>> Is Jonas way [1] of modifiying pktgen maybe the best option to ensure
>>>>>>>>> that the producer is faster than the consumer?
>>>>>>>>>
>>>>>>>>> [1] Link:https://lore.kernel.org/netdev/e8cdba04-aa9a-45c6-9807-8274b62920df@xxxxxxxxxxxxxx/
>>>>>>>>>
>>>>>>>>>> Hi, so what I found is that pktgen does not respect
>>>>>>>>>> __QUEUE_STATE_STACK_OFF. So the test data above is invalid, since it
>>>>>>>>>> just sent packets even if the BQL "stopped" the queue. So I patched
>>>>>>>>>> pktgen with the following:
>>>>>>>>>>
>>>>>>>>>> - if (unlikely(netif_xmit_frozen_or_drv_stopped(txq))) {
>>>>>>>>>> + if (unlikely(netif_xmit_frozen_or_stopped(txq))) {
>>>>>>>>> After thinking more about the implementation I see possible issues:
>>>>>>>>>
>>>>>>>>> 1. netdev_tx_completed_queue() never reports more than burst=64 packets:
>>>>>>>>>
>>>>>>>>> BQL only increments the limit if the queue was starved. That means:
>>>>>>>>> "The queue was over-limit in the last interval (the last time completion
>>>>>>>>> processing ran), and there is no more data in the queue (i.e. it’s
>>>>>>>>> empty)" [2]
>>>>>>>>> But as only 64 packets are reported at max, the queue can only grow when
>>>>>>>>> it is <= 64 packets. And then it can only stay at a limit >64 until the
>>>>>>>>> next decrease of the limit.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2. netdev_tx_completed_queue() is called in irregular intervals:
>>>>>>>>>
>>>>>>>>> If the consumer is slow it is called approx each tx_coal_usecs.
>>>>>>>>> But if the consumer is fast it is called way more frequent, probably
>>>>>>>>> in irregular intervals depending on the scheduling.
>>>>>>>>> However, "BQL depends on periodic completion interrupts" [2].
>>>>>>>>>
>>>>>>>>> --> How about adding something like an interrupt that triggers every
>>>>>>>>> 10us and calls netdev_tx_completed_queue() with n_bql collected from
>>>>>>>>> (multiple) veth_xdp_rcv runs? That could solve 1. and 2.
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I worked on a new version (see attachment) that addresses both issues.
>>>>>>>>
>>>>>>>> The major change is that instead of tracking the timestamp and packet
>>>>>>>> count as local variables in veth_xdp_rcv(), they are now stored
>>>>>>>> persistently in veth_rq as struct veth_bql_state. This allows completions
>>>>>>>> to accumulate across multiple NAPI poll calls, so
>>>>>>>> netdev_tx_completed_queue() can report more than 64 packets at once
>>>>>>>> (see point 1). To get the time I am using (the fast) sched_clock() with
>>>>>>>> a trick to avoid issues when switching between CPUs.
>>>>>>>>
>>>>>>>> For point 2, the coalescing deadline is now checked both before the
>>>>>>>> receive loop (to flush completions that timed out since the previous
>>>>>>>> poll) and after each consumed packet, making completion intervals more
>>>>>>>> regular. Still the intervals can be smaller than
>>>>>>>> VETH_BQL_COAL_TX_USECS, but I guess this is fine.
>>>>>>>>
>>>>>>>> I also found out that the BQL limit correlates closely with
>>>>>>>> VETH_BQL_COAL_TX_USECS. It essentially reflects the latency we are
>>>>>>>> targeting. I raised the default to 100 µs to allow DQL to converge to a
>>>>>>>> higher limit (for reaching 255 in the testing below).
>>>>>>>>
>>>>>>>> With the patched pktgen (respecting __QUEUE_STATE_STACK_XOFF), testing
>>>>>>>> shows:
>>>>>>>> - --nrules 0: DQL limit reaches (up to) ~255
>>>>>>>> - --nrules 10000: DQL limit converges to ~0 (with --gro-disable)
>>>>>>>>
>>>>>>>> These results are what I would expect from a BQL algorithm, but more
>>>>>>>> testing is needed of course.
>>>>>>>>
>>>>>>>> What do you think?
>>>>>>> Hi,
>>>>>>>
>>>>>>> This is exactly what I had in mind for implementing the BQL algorithm
>>>>>>> in this case. I did some testing with pktgen of this patch and also
>>>>>>> compared it to the v5 version.
>>>>>>>
>>>>>>> You can find an extension of the benchmark script with pktgen here [1],
>>>>>>> as well as a wrapper script (veth_bql_bench.sh) to run the test script
>>>>>>> with and without --bql-disable to report the difference. I also
>>>>>>> configured pktgen to use the qdisc as suggested by Jesper.
>>>>>> Great, I will use your pktgen solution from now on.
>>>>>>
>>>>>> Didn't know about the qdisc option, is there a performance difference
>>>>>> with/without it? Or is it to have ping working next to pktgen?
>>>>> Did not see any big difference performance-wise, but as you say, ping
>>>>> works better with pktgen then.
>>>>>
>>>>>> Consider to do a pull request :)
>>>>>>
>>>>>>> Note: bpftrace needs to be disabled, otherwise it becomes the
>>>>>>> bottleneck (at least on my machine) and pktgen throughput is halved
>>>>>>> when enabled.
>>>>>> Good to know.
>>>>>>
>>>>>>> Here are the results:
>>>>>>>
>>>>>>> v5 (not time-based):
>>>>>>> --nrules 0 --pktgen --no-bpftrace
>>>>>>> ========================================
>>>>>>> Results (average over 10 runs):
>>>>>>> ========================================
>>>>>>> BQL on BQL off
>>>>>>> --- ------ -------
>>>>>>> Throughput (pps) 1980871 2169898
>>>>>>> Ping RTT avg (ms) 0.065 0.162
>>>>>>> Throughput diff -8.7% // BQL 8.7% lower throughput
>>>>>>> RTT diff -59.9% // BQL 60% lower latency
>>>>>>> ========================================
>>>>>>>
>>>>>>> Simon's time-based version:
>>>>>>>
>>>>>>> Test args: --nrules 0 --pktgen --no-bpftrace
>>>>>>> ========================================
>>>>>>> Results (average over 10 runs):
>>>>>>> ========================================
>>>>>>> BQL on BQL off
>>>>>>> --- ------ -------
>>>>>>> Throughput (pps) 2166335 2153398
>>>>>>> Ping RTT avg (ms) 0.165 0.165
>>>>>>> Throughput diff 0.6%
>>>>>>> RTT diff 0.0%
>>>>>>>
>>>>>>> --pktgen --no-bpftrace --nrules 3500
>>>>>>> ========================================
>>>>>>> Results (average over 10 runs):
>>>>>>> ========================================
>>>>>>> BQL on BQL off
>>>>>>> --- ------ -------
>>>>>>> Throughput (pps) 28569 28696
>>>>>>> Ping RTT avg (ms) 1.327 8.409
>>>>>>> Throughput diff -0.4%
>>>>>>> RTT diff -84.2%
>>>>>>>
>>>>>> I think we should run benchmarks against the stock net-next to
>>>>>> be safe.
>>>>> --nrules 0 --pktgen --no-bpftrace
>>>>> ========================================
>>>>> Results (average over 10 runs):
>>>>> ========================================
>>>>> net-next
>>>>> --- -------
>>>>> Throughput (pps) 2285421
>>>>> Ping RTT avg (ms) 0.161
>>>>>
>>>>> (slightly adjusted the output to better communicate the results)
>>>>>
>>>>> So in my case this means BQL implementation has ~5% lower throughput
>>>>> compared to net-next. But please double check.
>>>>>
>>>> Yes, I will benchmark myself.
>>>>
>>>> There are probably some places where we can gain performance.
>>>> For example, I see ptr_ring_empty() which could be swapped for
>>>> __ptr_ring_empty() which would save a spinlock and unlock.
>>>>
>>>>>>> Seems to work now as expected.
>>>>>> Yes, but I think we have to keep these points in mind:
>>>>>>
>>>>>> 1. Limit/Inflight can be bigger than VETH_RING_SIZE, because
>>>>>> packets can be enqueued in the same time as they are read out,
>>>>>> so netdev_tx_completed_queue() can theoretically be called with
>>>>>> many number of packets.
>>>>>> I do not think it is deal-breaking though.
>>>>>> I could see such high limits/inflights when looking at the /sys
>>>>>> BQL statistics..
>>>>> For me this makes sense, that inflight just means the number of
>>>>> packets not yet 'completed' or the number of packets that you
>>>>> can send between two completion calls. I think this is not specific
>>>> From my understanding, I do think that this behavior is
>>>> pretty specific.
>>>>
>>>> Typically BQL-enabled NIC drivers clear packets out of some
>>>> internal buffer in their completion interrupt (or something
>>>> similar). And after that they call netdev_tx_completed_queue().
>>> You are right, because descriptors are freed at the same time when they are accounted for.
>>>
>>>>> to this implementation. But for long intervals this might result in
>>>>> some problems because you can just fill the veth_ring to its capacity
>>>>> quickly, and increasing latency if the receiver is slow.
>>>> Yes, but I think the latency can only be approx. as big as the
>>>> interval (can be higher with GRO enabled).
>>> I think it's more like ~2.x times added latency in the worst case,
>>> which is normal BQL behavior, but for large intervals 2x is a lot.
>>> Of course this value is also capped by the time it takes to process
>>> the 256 packets that fit in the ring buffer.
>>>
>>> For example, for a test run with fq_codel and 20K rules, and a ping
>>> baseline of 1.7 ms without pktgen load, the results are:
>>> tx-usecs p99_ping_rtt (ms)
>>> -------- -----------------
>>> 0 5.223
>>> 100 6.910
>>> 500 7.197
>>> 1000 6.967
>>> 5000 16.233
>>> 10000 25.033
>>> 20000 44.133
>>>
>>> Also, the interval is sometimes dominated by the processing time of
>>> a packet batch (and other effects). The default is 8 packets/batch
>>> (see gro_normal_batch). This means that for 7 packets you see
>>> processing times on the order of a few hundred nanoseconds, but for
>>> the 8th you see the processing time of the whole batch (see table below).
>>> With a ruleset > 1000 entries, this exceeds the default tx-usecs value,
>>> which is why the p99 RTT increase is even larger than 2x tx-usecs
>>> -- as you can see in the table for tx-usecs values 0-5000.
>>>
>>> tx-usecs 0 nrules 0
>>> Percentile per-packet processing time (µs)
>>> ---------- -------------
>>> p5 0.316
>>> p25 0.382
>>> p50 0.442
>>> p75 0.508 -- fast enqueue to skb_list
>>> p95 68.304 -- batch processing time
>>>
>>> Setting gro_normal_batch to 1 makes the interval times more
>>> accurate and improves p99 RTT slightly, though it's still > 2x the
>>> tx-usecs latency increase.
>>>
>>> However, using smaller values > 0 means that you can benefit
>>> from bql, but the actual interval might be larger than the
>>> configured value. I do not think this is a deal breaker, but
>>> something worth noting where the latencies come from.
>> I understand and agree with your points.
>> I also do not see anything deal-breaking.
>> Yes, there is no guarantee for the latency, but BQL
>> is best-effort anyway.
>>
>> I will wait for your new measurements, but there is no argument
>> against a default tx-usecs of ~100us for now, right?
>
> Yes, I think 100us is perfectly fine. I guess most of it was
> just my curiosity why the latency values are as they are :)

Which is great, because I was wondering the same :)

> But it feels like this will need some documentation, because
> as we have seen, some values are a little different
> from what you expect from bql. Inflight > veth_ring_size,
> tx-usecs not necessarily achieving the configured value,
> inflight can get stuck > 0. Wdyt?
> But I think it works nicely overall.
>

Exactly, we should get ready for a v6 soon.

And I think we should move the BQL logic into a seperate .h file
as a library. Then it is also usable for TUN/TAP in the future.

Let's amend the commits. Should we do this on Github?

>>> Also worth noting: using fq_codel is better to use in this
>>> case because it can fast track sparse flows. using sfq the
>>> ping packet can get head of line blocked by the current quantum
>>> of sfq, if my understanding is correct.
>> Why don't we use a prio qdisc where ping gets prioritized?
>> Then we do not have to worry about head of line blocking on
>> the qdisc layer and instead can concentrate on veth.
>>
>>>>> To illustrate this have a look at [1]. There are some plots that
>>>>> show the rtt vs. tx-usec config depending on nrules.
>>>>>
>>>>> [1] https://github.com/jkoeppeler/veth-backpressure-performance-testing/tree/pktgen-and-benchmark/results/tx-usecs
>>>>>
>>>> Nice plots! Great for finding a sane default value.
>>> I did some more measurements to better understand whats
>>> happening. I will upload the results latest by tomorrow
>>> to the github repository mentioned above.
>>>
>>>>>> 2. sched_clock() is only valid on the same CPU. When a different
>>>>>> CPU starts executing its sched_clock() can be in the past compared
>>>>>> to the sched_clock() value saved by the previous CPU.
>>>>>> My trick...
>>>>>> min(s->time, sched_clock())
>>>>>> ... avoids potentially extremely long intervals between
>>>>>> netdev_tx_completed_queue() calls but is not perfect of course.
>>>>>> I think CPU hopping happens rarely enough for this to matter..
>>>>>> And also we have to keep this in mind [1]:
>>>>>> "An architecture may or may not provide an implementation of
>>>>>> sched_clock() on its own. If a local implementation is not provided,
>>>>>> the system jiffy counter will be used as sched_clock()."
>>>>> So the problem with this is that with jiffies you have like millisecond
>>>>> interval granularity, which might be too long in order to work properly.
>>>>> Given the receiver completes 128 packet in 1ms (queue_completed interval),
>>>>> the bql will set the limit at 256. Then the tx thread can quickly fill
>>>>> the ring, and it then basically stopped until the 1ms interval is over.
>>>>>
>>>> Linux Mint only has 100Hz jiffies :^)
>>> ok so 10ms granularity. Effectively disables BQL for the most part.
>>> However, you can still benefit if your receive side is very slow.
>>>
>>>> I think you misunderstand something.
>>>>
>>>> + if (peer_txq && state->n_bql && ptr_ring_empty(&rq->xdp_ring)) {
>> // Pairs with smp_wmb() in __ptr_ring_produce()
>>
>>>> + smp_rmb();
>>>> + if (test_bit(__QUEUE_STATE_STACK_XOFF, &peer_txq->state))
>>>> + veth_bql_complete(state, peer_txq);
>>>> + }
>>>>
>>>> This snippet completes whenever the queue is empty and the
>>>> netdev queue is stopped due to BQL.
>>>> So then the interval is smaller than 1ms.
>>>> I think, that is the reason why the pps is fine for 1ms in
>>>> your benchmarks.
>>> Without that snippet the queue stalls during ramp-up and
>>> throughput never gets off the ground. But once the BQL limit has
>>> stabilized, the snippet is no longer invoked -- with --nrules 0
>>> the queue is never stopped, and with a slow receiver the queue is
>>> not empty. So the snippet is needed for correct ramp-up, but it
>>> doesn't affect steady-state throughput in these tests. This is also
>>> confirmed by a test where I measure the time between intervals.
>>> For 1000us across all nrules configurations the time ranges between
>>> 1000-1400us.
>> Yes, I agree. It only acts in the case of starvation (see [1])
>> and I think it is an advantage that we react faster compared
>> to other BQL implementations.
>>
>> Additionally the snippet is *required* because if the ring
>> is empty and the txq stopped due to BQL, there are no more
>> veth_xdp_recv() calls, permanently stalling everything.
>> That is the reason for the memory barrier pairing in the snippet.
>>
>> [1] Link: https://medium.com/@tom_84912/byte-queue-limits-the-unauthorized-biography-61adc5730b83
>>
>>> The variance of the time between intervals is also quite good in
>>> general, but not always exact the configured value. For most cases
>>> +-5% from p50 value of the real interval.
>>> Migrations between CPUs can also shorten an interval: if the new
>>> CPU's sched_clock() is ahead of the previous one (say by 15 ms),
>>> the apparent elapsed time within the current interval is inflated,
>>> so only ~5 ms of real time remains before the deadline fires (with
>>> the configured tx-usecs at 20 ms). Without migrations the interval
>>> stays consistently close to the configured value. I don't think
>>> this is a problem in practice -- it just adds some variance to the
>>> p5 of the measured interval.
>> Yes, and then we mistakenly decrease the BQL limit once.
>> But we react fast on that starvation. So its fine.
>>
>>>>> I don't know how much this matters in practice. Not sure which architectures still hit the jiffies fallback.When in doubt we could disable BQL for those, or pin it to a "virtual queue size" via limit_min / limit_max similar to the v5 behavior and call queue_completed for every packet. Wdyt?
>>>> Yes, probably not relevant, I think I would just disable it for those
>>>> architectures, avoiding possible regressions.
>>>>
>>>>>> 3. Inflight can be stuck at a value>0 for a long time when packet
>>>>>> enqueueing stops. Only when packets are enqueued again,
>>>>>> (on the next veth_xdp_rcv() call,) netdev_tx_completed_queue() is
>>>>>> executed and inflight is set to 0 again.
>>>>>> Can also be seen when looking at the /sys BQL statistics.
>>>>> This happens if the ring empties before the next queue_completed fires,
>>>>> right?
>>>>>
>>>> Exactly.
>>>>
>>>>> Another thing: Is it counterintuitive to set the tx_usec_coal on the
>>>>> receive device? Because this is a bql related config that is normally
>>>>> configured on the TX side?
>>>> Yes, I would say so.
>>>>
>>>>>> BTW: Yesterday, I worked on and refactored the code into its own .h
>>>>>> file as a library and it also works fine for TUN/TAP (+vhost-net)
>>>>>> for me :)
>>>>> Nice! Thank you.
>>>>> Jonas
>>>>>
>>>>>> Thanks for your work!
>>>>>> Simon
>>>>>>
>>>>>> [1] Link: https://docs.kernel.org/timers/timekeeping.html
>>>>>>
>>>>>>> [1]https://github.com/jkoeppeler/veth-backpressure-performance-testing/tree/pktgen-and-benchmark
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Jonas
>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> BTW: I think that this implementation could also work for other
>>>>>>>> software interfaces.
>>>>>>>>
>>>>>>>>> [2] Link:https://medium.com/@tom_84912/byte-queue-limits-the-unauthorized-biography-61adc5730b83
>>>>>>>>>
>>>>>>>>>> There is an important gotcha. We actually have micro-burst of queuing
>>>>>>>>>> (likely due to scheduling noise). Reading BQL stats from /sys will show
>>>>>>>>>> BQL inflight=1, but when using the option --hist is it visible that
>>>>>>>>>> @inflight have a long tail (see below signature). The "qdisc" output
>>>>>>>>>> line also shows this happening via requeues increasing (approx 17/sec in
>>>>>>>>>> a test with 567Kpps). (this was with the time-based BQL impl).
>>>>>>>>> I understand..
>>>>>>>>>