Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction

From: Simon Schippers

Date: Tue May 26 2026 - 10:53:06 EST

On 5/26/26 11:54, Jonas Köppeler wrote:
> On 5/23/26 6:09 PM, Simon Schippers wrote:
>> On 5/22/26 18:26, Jonas Köppeler wrote:
>>> On 5/22/26 10:41, Simon Schippers wrote:
>>>> On 5/22/26 09:14, Jonas Köppeler wrote:
>>>>> On 5/19/26 10:51 PM, Simon Schippers wrote:
>>>>>> On 5/12/26 23:55, Simon Schippers wrote:
>>>>>>> On 5/12/26 15:54, Jesper Dangaard Brouer wrote:
>>>>>>>>>> Nope, I'm using a bpftrace program to keep track of the inflight/limit
>>>>>>>>>> in a BPF hashmap. Reading from /sys will not be accurate.
>>>>>>>>> Ah nice.
>>>>>>>> Add the option --hist to have both NAPI and BQL histograms printed when
>>>>>>>> script ends. This will give you an accurate pattern of how inflight and
>>>>>>>> limit evolves.
>>>>>>>>
>>>>>>>>>> I moved the selftests into a github repo [1] to allow us to collaborate
>>>>>>>>>> and evaluate the changes more easily. I explicitly kept the new BPF
>>>>>>>>>> based BQL tracking as a commit[2] for your benefit.
>>>>>>>>>>
>>>>>>>>>> [1]https://github.com/netoptimizer/veth-backpressure-performance-testing/tree/main/selftests
>>>>>>>>>>
>>>>>>>>>> [2]https://github.com/netoptimizer/veth-backpressure-performance-testing/commit/f25c5dc92977
>>>>>>>>> Thanks for sharing. After minor issues I was able to set it up
>>>>>>>>> (currently I am just using plain v5, will look at the coalescing patch
>>>>>>>>> when I find the time):
>>>>>>>>>
>>>>>>>>> Can confirm the latency reduction with the default settings, in my case
>>>>>>>>> 4.888ms to 0.241ms.
>>>>>>>>>
>>>>>>>>> With the same script I was also able to see a performance slow down:
>>>>>>>>> veth_bql_test_virtme.sh --qdisc fq_codel --nrules 0
>>>>>>>>> --> ~510 Kpps
>>>>>>>>> Same with --bql-disable
>>>>>>>>> --> ~570 Kpps
>>>>>>>>> --> 12% faster
>>>>>>>>>
>>>>>>>> Thanks for running these benchmarks.
>>>>>>>>
>>>>>>>> Notice that --nrules 0 can easily result in no-queuing (on average),
>>>>>>>> because the veth NAPI consumer is faster than the producer. You will
>>>>>>>> likely see BQL inflight=1 and sink reported avg latency very low
>>>>>>>> (remember it okay that sink get high latency penalty as long at ping
>>>>>>>> latency remains low, as that show AQM is working).
>>>>>>> I ran the benchmarks with --hist and I see what you mean.
>>>>>>> I have very similar results.
>>>>>>>
>>>>>>> Is Jonas way [1] of modifiying pktgen maybe the best option to ensure
>>>>>>> that the producer is faster than the consumer?
>>>>>>>
>>>>>>> [1] Link:https://lore.kernel.org/netdev/e8cdba04-aa9a-45c6-9807-8274b62920df@xxxxxxxxxxxxxx/
>>>>>>>
>>>>>>>> Hi, so what I found is that pktgen does not respect
>>>>>>>> __QUEUE_STATE_STACK_OFF. So the test data above is invalid, since it
>>>>>>>> just sent packets even if the BQL "stopped" the queue. So I patched
>>>>>>>> pktgen with the following:
>>>>>>>>
>>>>>>>> - if (unlikely(netif_xmit_frozen_or_drv_stopped(txq))) {
>>>>>>>> + if (unlikely(netif_xmit_frozen_or_stopped(txq))) {
>>>>>>> After thinking more about the implementation I see possible issues:
>>>>>>>
>>>>>>> 1. netdev_tx_completed_queue() never reports more than burst=64 packets:
>>>>>>>
>>>>>>> BQL only increments the limit if the queue was starved. That means:
>>>>>>> "The queue was over-limit in the last interval (the last time completion
>>>>>>> processing ran), and there is no more data in the queue (i.e. it’s
>>>>>>> empty)" [2]
>>>>>>> But as only 64 packets are reported at max, the queue can only grow when
>>>>>>> it is <= 64 packets. And then it can only stay at a limit >64 until the
>>>>>>> next decrease of the limit.
>>>>>>>
>>>>>>>
>>>>>>> 2. netdev_tx_completed_queue() is called in irregular intervals:
>>>>>>>
>>>>>>> If the consumer is slow it is called approx each tx_coal_usecs.
>>>>>>> But if the consumer is fast it is called way more frequent, probably
>>>>>>> in irregular intervals depending on the scheduling.
>>>>>>> However, "BQL depends on periodic completion interrupts" [2].
>>>>>>>
>>>>>>> --> How about adding something like an interrupt that triggers every
>>>>>>> 10us and calls netdev_tx_completed_queue() with n_bql collected from
>>>>>>> (multiple) veth_xdp_rcv runs? That could solve 1. and 2.
>>>>>> Hi,
>>>>>>
>>>>>> I worked on a new version (see attachment) that addresses both issues.
>>>>>>
>>>>>> The major change is that instead of tracking the timestamp and packet
>>>>>> count as local variables in veth_xdp_rcv(), they are now stored
>>>>>> persistently in veth_rq as struct veth_bql_state. This allows completions
>>>>>> to accumulate across multiple NAPI poll calls, so
>>>>>> netdev_tx_completed_queue() can report more than 64 packets at once
>>>>>> (see point 1). To get the time I am using (the fast) sched_clock() with
>>>>>> a trick to avoid issues when switching between CPUs.
>>>>>>
>>>>>> For point 2, the coalescing deadline is now checked both before the
>>>>>> receive loop (to flush completions that timed out since the previous
>>>>>> poll) and after each consumed packet, making completion intervals more
>>>>>> regular. Still the intervals can be smaller than
>>>>>> VETH_BQL_COAL_TX_USECS, but I guess this is fine.
>>>>>>
>>>>>> I also found out that the BQL limit correlates closely with
>>>>>> VETH_BQL_COAL_TX_USECS. It essentially reflects the latency we are
>>>>>> targeting. I raised the default to 100 µs to allow DQL to converge to a
>>>>>> higher limit (for reaching 255 in the testing below).
>>>>>>
>>>>>> With the patched pktgen (respecting __QUEUE_STATE_STACK_XOFF), testing
>>>>>> shows:
>>>>>> - --nrules 0: DQL limit reaches (up to) ~255
>>>>>> - --nrules 10000: DQL limit converges to ~0 (with --gro-disable)
>>>>>>
>>>>>> These results are what I would expect from a BQL algorithm, but more
>>>>>> testing is needed of course.
>>>>>>
>>>>>> What do you think?
>>>>> Hi,
>>>>>
>>>>> This is exactly what I had in mind for implementing the BQL algorithm
>>>>> in this case. I did some testing with pktgen of this patch and also
>>>>> compared it to the v5 version.
>>>>>
>>>>> You can find an extension of the benchmark script with pktgen here [1],
>>>>> as well as a wrapper script (veth_bql_bench.sh) to run the test script
>>>>> with and without --bql-disable to report the difference. I also
>>>>> configured pktgen to use the qdisc as suggested by Jesper.
>>>> Great, I will use your pktgen solution from now on.
>>>>
>>>> Didn't know about the qdisc option, is there a performance difference
>>>> with/without it? Or is it to have ping working next to pktgen?
>>> Did not see any big difference performance-wise, but as you say, ping
>>> works better with pktgen then.
>>>
>>>> Consider to do a pull request :)
>>>>
>>>>> Note: bpftrace needs to be disabled, otherwise it becomes the
>>>>> bottleneck (at least on my machine) and pktgen throughput is halved
>>>>> when enabled.
>>>> Good to know.
>>>>
>>>>> Here are the results:
>>>>>
>>>>> v5 (not time-based):
>>>>> --nrules 0 --pktgen --no-bpftrace
>>>>> ========================================
>>>>> Results (average over 10 runs):
>>>>> ========================================
>>>>> BQL on BQL off
>>>>> --- ------ -------
>>>>> Throughput (pps) 1980871 2169898
>>>>> Ping RTT avg (ms) 0.065 0.162
>>>>> Throughput diff -8.7% // BQL 8.7% lower throughput
>>>>> RTT diff -59.9% // BQL 60% lower latency
>>>>> ========================================
>>>>>
>>>>> Simon's time-based version:
>>>>>
>>>>> Test args: --nrules 0 --pktgen --no-bpftrace
>>>>> ========================================
>>>>> Results (average over 10 runs):
>>>>> ========================================
>>>>> BQL on BQL off
>>>>> --- ------ -------
>>>>> Throughput (pps) 2166335 2153398
>>>>> Ping RTT avg (ms) 0.165 0.165
>>>>> Throughput diff 0.6%
>>>>> RTT diff 0.0%
>>>>>
>>>>> --pktgen --no-bpftrace --nrules 3500
>>>>> ========================================
>>>>> Results (average over 10 runs):
>>>>> ========================================
>>>>> BQL on BQL off
>>>>> --- ------ -------
>>>>> Throughput (pps) 28569 28696
>>>>> Ping RTT avg (ms) 1.327 8.409
>>>>> Throughput diff -0.4%
>>>>> RTT diff -84.2%
>>>>>
>>>> I think we should run benchmarks against the stock net-next to
>>>> be safe.
>>> --nrules 0 --pktgen --no-bpftrace
>>> ========================================
>>> Results (average over 10 runs):
>>> ========================================
>>> net-next
>>> --- -------
>>> Throughput (pps) 2285421
>>> Ping RTT avg (ms) 0.161
>>>
>>> (slightly adjusted the output to better communicate the results)
>>>
>>> So in my case this means BQL implementation has ~5% lower throughput
>>> compared to net-next. But please double check.
>>>
>> Yes, I will benchmark myself.
>>
>> There are probably some places where we can gain performance.
>> For example, I see ptr_ring_empty() which could be swapped for
>> __ptr_ring_empty() which would save a spinlock and unlock.
>>
>>>>> Seems to work now as expected.
>>>> Yes, but I think we have to keep these points in mind:
>>>>
>>>> 1. Limit/Inflight can be bigger than VETH_RING_SIZE, because
>>>> packets can be enqueued in the same time as they are read out,
>>>> so netdev_tx_completed_queue() can theoretically be called with
>>>> many number of packets.
>>>> I do not think it is deal-breaking though.
>>>> I could see such high limits/inflights when looking at the /sys
>>>> BQL statistics..
>>> For me this makes sense, that inflight just means the number of
>>> packets not yet 'completed' or the number of packets that you
>>> can send between two completion calls. I think this is not specific
>> From my understanding, I do think that this behavior is
>> pretty specific.
>>
>> Typically BQL-enabled NIC drivers clear packets out of some
>> internal buffer in their completion interrupt (or something
>> similar). And after that they call netdev_tx_completed_queue().
>
> You are right, because descriptors are freed at the same time when they are accounted for.
>
>>> to this implementation. But for long intervals this might result in
>>> some problems because you can just fill the veth_ring to its capacity
>>> quickly, and increasing latency if the receiver is slow.
>> Yes, but I think the latency can only be approx. as big as the
>> interval (can be higher with GRO enabled).
>
> I think it's more like ~2.x times added latency in the worst case,
> which is normal BQL behavior, but for large intervals 2x is a lot.
> Of course this value is also capped by the time it takes to process
> the 256 packets that fit in the ring buffer.
>
> For example, for a test run with fq_codel and 20K rules, and a ping
> baseline of 1.7 ms without pktgen load, the results are:
> tx-usecs p99_ping_rtt (ms)
> -------- -----------------
> 0 5.223
> 100 6.910
> 500 7.197
> 1000 6.967
> 5000 16.233
> 10000 25.033
> 20000 44.133
>
> Also, the interval is sometimes dominated by the processing time of
> a packet batch (and other effects). The default is 8 packets/batch
> (see gro_normal_batch). This means that for 7 packets you see
> processing times on the order of a few hundred nanoseconds, but for
> the 8th you see the processing time of the whole batch (see table below).
> With a ruleset > 1000 entries, this exceeds the default tx-usecs value,
> which is why the p99 RTT increase is even larger than 2x tx-usecs
> -- as you can see in the table for tx-usecs values 0-5000.
>
> tx-usecs 0 nrules 0
> Percentile per-packet processing time (µs)
> ---------- -------------
> p5 0.316
> p25 0.382
> p50 0.442
> p75 0.508 -- fast enqueue to skb_list
> p95 68.304 -- batch processing time
>
> Setting gro_normal_batch to 1 makes the interval times more
> accurate and improves p99 RTT slightly, though it's still > 2x the
> tx-usecs latency increase.
>
> However, using smaller values > 0 means that you can benefit
> from bql, but the actual interval might be larger than the
> configured value. I do not think this is a deal breaker, but
> something worth noting where the latencies come from.

I understand and agree with your points.
I also do not see anything deal-breaking.
Yes, there is no guarantee for the latency, but BQL
is best-effort anyway.

I will wait for your new measurements, but there is no argument
against a default tx-usecs of ~100us for now, right?

>
> Also worth noting: using fq_codel is better to use in this
> case because it can fast track sparse flows. using sfq the
> ping packet can get head of line blocked by the current quantum
> of sfq, if my understanding is correct.

Why don't we use a prio qdisc where ping gets prioritized?
Then we do not have to worry about head of line blocking on
the qdisc layer and instead can concentrate on veth.

>
>>> To illustrate this have a look at [1]. There are some plots that
>>> show the rtt vs. tx-usec config depending on nrules.
>>>
>>> [1] https://github.com/jkoeppeler/veth-backpressure-performance-testing/tree/pktgen-and-benchmark/results/tx-usecs
>>>
>> Nice plots! Great for finding a sane default value.
>
> I did some more measurements to better understand whats
> happening. I will upload the results latest by tomorrow
> to the github repository mentioned above.
>
>>>> 2. sched_clock() is only valid on the same CPU. When a different
>>>> CPU starts executing its sched_clock() can be in the past compared
>>>> to the sched_clock() value saved by the previous CPU.
>>>> My trick...
>>>> min(s->time, sched_clock())
>>>> ... avoids potentially extremely long intervals between
>>>> netdev_tx_completed_queue() calls but is not perfect of course.
>>>> I think CPU hopping happens rarely enough for this to matter..
>>>> And also we have to keep this in mind [1]:
>>>> "An architecture may or may not provide an implementation of
>>>> sched_clock() on its own. If a local implementation is not provided,
>>>> the system jiffy counter will be used as sched_clock()."
>>> So the problem with this is that with jiffies you have like millisecond
>>> interval granularity, which might be too long in order to work properly.
>>> Given the receiver completes 128 packet in 1ms (queue_completed interval),
>>> the bql will set the limit at 256. Then the tx thread can quickly fill
>>> the ring, and it then basically stopped until the 1ms interval is over.
>>>
>>
>> Linux Mint only has 100Hz jiffies :^)
>
> ok so 10ms granularity. Effectively disables BQL for the most part.
> However, you can still benefit if your receive side is very slow.
>
>>
>> I think you misunderstand something.
>>
>> + if (peer_txq && state->n_bql && ptr_ring_empty(&rq->xdp_ring)) {

// Pairs with smp_wmb() in __ptr_ring_produce()

>> + smp_rmb();
>> + if (test_bit(__QUEUE_STATE_STACK_XOFF, &peer_txq->state))
>> + veth_bql_complete(state, peer_txq);
>> + }
>>
>> This snippet completes whenever the queue is empty and the
>> netdev queue is stopped due to BQL.
>> So then the interval is smaller than 1ms.
>> I think, that is the reason why the pps is fine for 1ms in
>> your benchmarks.
>
> Without that snippet the queue stalls during ramp-up and
> throughput never gets off the ground. But once the BQL limit has
> stabilized, the snippet is no longer invoked -- with --nrules 0
> the queue is never stopped, and with a slow receiver the queue is
> not empty. So the snippet is needed for correct ramp-up, but it
> doesn't affect steady-state throughput in these tests. This is also
> confirmed by a test where I measure the time between intervals.
> For 1000us across all nrules configurations the time ranges between
> 1000-1400us.

Yes, I agree. It only acts in the case of starvation (see [1])
and I think it is an advantage that we react faster compared
to other BQL implementations.

Additionally the snippet is *required* because if the ring
is empty and the txq stopped due to BQL, there are no more
veth_xdp_recv() calls, permanently stalling everything.
That is the reason for the memory barrier pairing in the snippet.

[1] Link: https://medium.com/@tom_84912/byte-queue-limits-the-unauthorized-biography-61adc5730b83

>
> The variance of the time between intervals is also quite good in
> general, but not always exact the configured value. For most cases
> +-5% from p50 value of the real interval.
> Migrations between CPUs can also shorten an interval: if the new
> CPU's sched_clock() is ahead of the previous one (say by 15 ms),
> the apparent elapsed time within the current interval is inflated,
> so only ~5 ms of real time remains before the deadline fires (with
> the configured tx-usecs at 20 ms). Without migrations the interval
> stays consistently close to the configured value. I don't think
> this is a problem in practice -- it just adds some variance to the
> p5 of the measured interval.

Yes, and then we mistakenly decrease the BQL limit once.
But we react fast on that starvation. So its fine.

>
>>> I don't know how much this matters in practice. Not sure which architectures still hit the jiffies fallback.When in doubt we could disable BQL for those, or pin it to a "virtual queue size" via limit_min / limit_max similar to the v5 behavior and call queue_completed for every packet. Wdyt?
>> Yes, probably not relevant, I think I would just disable it for those
>> architectures, avoiding possible regressions.
>>
>>>> 3. Inflight can be stuck at a value>0 for a long time when packet
>>>> enqueueing stops. Only when packets are enqueued again,
>>>> (on the next veth_xdp_rcv() call,) netdev_tx_completed_queue() is
>>>> executed and inflight is set to 0 again.
>>>> Can also be seen when looking at the /sys BQL statistics.
>>> This happens if the ring empties before the next queue_completed fires,
>>> right?
>>>
>> Exactly.
>>
>>> Another thing: Is it counterintuitive to set the tx_usec_coal on the
>>> receive device? Because this is a bql related config that is normally
>>> configured on the TX side?
>> Yes, I would say so.
>>
>>>> BTW: Yesterday, I worked on and refactored the code into its own .h
>>>> file as a library and it also works fine for TUN/TAP (+vhost-net)
>>>> for me :)
>>> Nice! Thank you.
>>> Jonas
>>>
>>>> Thanks for your work!
>>>> Simon
>>>>
>>>> [1] Link: https://docs.kernel.org/timers/timekeeping.html
>>>>
>>>>> [1]https://github.com/jkoeppeler/veth-backpressure-performance-testing/tree/pktgen-and-benchmark
>>>>>
>>>>> Thanks,
>>>>> Jonas
>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> BTW: I think that this implementation could also work for other
>>>>>> software interfaces.
>>>>>>
>>>>>>> [2] Link:https://medium.com/@tom_84912/byte-queue-limits-the-unauthorized-biography-61adc5730b83
>>>>>>>
>>>>>>>> There is an important gotcha. We actually have micro-burst of queuing
>>>>>>>> (likely due to scheduling noise). Reading BQL stats from /sys will show
>>>>>>>> BQL inflight=1, but when using the option --hist is it visible that
>>>>>>>> @inflight have a long tail (see below signature). The "qdisc" output
>>>>>>>> line also shows this happening via requeues increasing (approx 17/sec in
>>>>>>>> a test with 567Kpps). (this was with the time-based BQL impl).
>>>>>>> I understand..
>>>>>>>