Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction

From: Simon Schippers

Date: Sat May 23 2026 - 12:10:20 EST

On 5/22/26 18:26, Jonas Köppeler wrote:
> On 5/22/26 10:41, Simon Schippers wrote:
>> On 5/22/26 09:14, Jonas Köppeler wrote:
>>> On 5/19/26 10:51 PM, Simon Schippers wrote:
>>>> On 5/12/26 23:55, Simon Schippers wrote:
>>>>> On 5/12/26 15:54, Jesper Dangaard Brouer wrote:
>>>>>>>> Nope, I'm using a bpftrace program to keep track of the inflight/limit
>>>>>>>> in a BPF hashmap. Reading from /sys will not be accurate.
>>>>>>> Ah nice.
>>>>>> Add the option --hist to have both NAPI and BQL histograms printed when
>>>>>> script ends. This will give you an accurate pattern of how inflight and
>>>>>> limit evolves.
>>>>>>
>>>>>>>> I moved the selftests into a github repo [1] to allow us to collaborate
>>>>>>>> and evaluate the changes more easily. I explicitly kept the new BPF
>>>>>>>> based BQL tracking as a commit[2] for your benefit.
>>>>>>>>
>>>>>>>> [1]https://github.com/netoptimizer/veth-backpressure-performance-testing/tree/main/selftests
>>>>>>>>
>>>>>>>> [2]https://github.com/netoptimizer/veth-backpressure-performance-testing/commit/f25c5dc92977
>>>>>>> Thanks for sharing. After minor issues I was able to set it up
>>>>>>> (currently I am just using plain v5, will look at the coalescing patch
>>>>>>> when I find the time):
>>>>>>>
>>>>>>> Can confirm the latency reduction with the default settings, in my case
>>>>>>> 4.888ms to 0.241ms.
>>>>>>>
>>>>>>> With the same script I was also able to see a performance slow down:
>>>>>>> veth_bql_test_virtme.sh --qdisc fq_codel --nrules 0
>>>>>>> --> ~510 Kpps
>>>>>>> Same with --bql-disable
>>>>>>> --> ~570 Kpps
>>>>>>> --> 12% faster
>>>>>>>
>>>>>> Thanks for running these benchmarks.
>>>>>>
>>>>>> Notice that --nrules 0 can easily result in no-queuing (on average),
>>>>>> because the veth NAPI consumer is faster than the producer. You will
>>>>>> likely see BQL inflight=1 and sink reported avg latency very low
>>>>>> (remember it okay that sink get high latency penalty as long at ping
>>>>>> latency remains low, as that show AQM is working).
>>>>> I ran the benchmarks with --hist and I see what you mean.
>>>>> I have very similar results.
>>>>>
>>>>> Is Jonas way [1] of modifiying pktgen maybe the best option to ensure
>>>>> that the producer is faster than the consumer?
>>>>>
>>>>> [1] Link:https://lore.kernel.org/netdev/e8cdba04-aa9a-45c6-9807-8274b62920df@xxxxxxxxxxxxxx/
>>>>>
>>>>>> Hi, so what I found is that pktgen does not respect
>>>>>> __QUEUE_STATE_STACK_OFF. So the test data above is invalid, since it
>>>>>> just sent packets even if the BQL "stopped" the queue. So I patched
>>>>>> pktgen with the following:
>>>>>>
>>>>>> - if (unlikely(netif_xmit_frozen_or_drv_stopped(txq))) {
>>>>>> + if (unlikely(netif_xmit_frozen_or_stopped(txq))) {
>>>>> After thinking more about the implementation I see possible issues:
>>>>>
>>>>> 1. netdev_tx_completed_queue() never reports more than burst=64 packets:
>>>>>
>>>>> BQL only increments the limit if the queue was starved. That means:
>>>>> "The queue was over-limit in the last interval (the last time completion
>>>>> processing ran), and there is no more data in the queue (i.e. it’s
>>>>> empty)" [2]
>>>>> But as only 64 packets are reported at max, the queue can only grow when
>>>>> it is <= 64 packets. And then it can only stay at a limit >64 until the
>>>>> next decrease of the limit.
>>>>>
>>>>>
>>>>> 2. netdev_tx_completed_queue() is called in irregular intervals:
>>>>>
>>>>> If the consumer is slow it is called approx each tx_coal_usecs.
>>>>> But if the consumer is fast it is called way more frequent, probably
>>>>> in irregular intervals depending on the scheduling.
>>>>> However, "BQL depends on periodic completion interrupts" [2].
>>>>>
>>>>> --> How about adding something like an interrupt that triggers every
>>>>> 10us and calls netdev_tx_completed_queue() with n_bql collected from
>>>>> (multiple) veth_xdp_rcv runs? That could solve 1. and 2.
>>>> Hi,
>>>>
>>>> I worked on a new version (see attachment) that addresses both issues.
>>>>
>>>> The major change is that instead of tracking the timestamp and packet
>>>> count as local variables in veth_xdp_rcv(), they are now stored
>>>> persistently in veth_rq as struct veth_bql_state. This allows completions
>>>> to accumulate across multiple NAPI poll calls, so
>>>> netdev_tx_completed_queue() can report more than 64 packets at once
>>>> (see point 1). To get the time I am using (the fast) sched_clock() with
>>>> a trick to avoid issues when switching between CPUs.
>>>>
>>>> For point 2, the coalescing deadline is now checked both before the
>>>> receive loop (to flush completions that timed out since the previous
>>>> poll) and after each consumed packet, making completion intervals more
>>>> regular. Still the intervals can be smaller than
>>>> VETH_BQL_COAL_TX_USECS, but I guess this is fine.
>>>>
>>>> I also found out that the BQL limit correlates closely with
>>>> VETH_BQL_COAL_TX_USECS. It essentially reflects the latency we are
>>>> targeting. I raised the default to 100 µs to allow DQL to converge to a
>>>> higher limit (for reaching 255 in the testing below).
>>>>
>>>> With the patched pktgen (respecting __QUEUE_STATE_STACK_XOFF), testing
>>>> shows:
>>>> - --nrules 0: DQL limit reaches (up to) ~255
>>>> - --nrules 10000: DQL limit converges to ~0 (with --gro-disable)
>>>>
>>>> These results are what I would expect from a BQL algorithm, but more
>>>> testing is needed of course.
>>>>
>>>> What do you think?
>>> Hi,
>>>
>>> This is exactly what I had in mind for implementing the BQL algorithm
>>> in this case. I did some testing with pktgen of this patch and also
>>> compared it to the v5 version.
>>>
>>> You can find an extension of the benchmark script with pktgen here [1],
>>> as well as a wrapper script (veth_bql_bench.sh) to run the test script
>>> with and without --bql-disable to report the difference. I also
>>> configured pktgen to use the qdisc as suggested by Jesper.
>> Great, I will use your pktgen solution from now on.
>>
>> Didn't know about the qdisc option, is there a performance difference
>> with/without it? Or is it to have ping working next to pktgen?
>
> Did not see any big difference performance-wise, but as you say, ping
> works better with pktgen then.
>
>> Consider to do a pull request :)
>>
>>> Note: bpftrace needs to be disabled, otherwise it becomes the
>>> bottleneck (at least on my machine) and pktgen throughput is halved
>>> when enabled.
>> Good to know.
>>
>>> Here are the results:
>>>
>>> v5 (not time-based):
>>> --nrules 0 --pktgen --no-bpftrace
>>> ========================================
>>> Results (average over 10 runs):
>>> ========================================
>>> BQL on BQL off
>>> --- ------ -------
>>> Throughput (pps) 1980871 2169898
>>> Ping RTT avg (ms) 0.065 0.162
>>> Throughput diff -8.7% // BQL 8.7% lower throughput
>>> RTT diff -59.9% // BQL 60% lower latency
>>> ========================================
>>>
>>> Simon's time-based version:
>>>
>>> Test args: --nrules 0 --pktgen --no-bpftrace
>>> ========================================
>>> Results (average over 10 runs):
>>> ========================================
>>> BQL on BQL off
>>> --- ------ -------
>>> Throughput (pps) 2166335 2153398
>>> Ping RTT avg (ms) 0.165 0.165
>>> Throughput diff 0.6%
>>> RTT diff 0.0%
>>>
>>> --pktgen --no-bpftrace --nrules 3500
>>> ========================================
>>> Results (average over 10 runs):
>>> ========================================
>>> BQL on BQL off
>>> --- ------ -------
>>> Throughput (pps) 28569 28696
>>> Ping RTT avg (ms) 1.327 8.409
>>> Throughput diff -0.4%
>>> RTT diff -84.2%
>>>
>> I think we should run benchmarks against the stock net-next to
>> be safe.
>
> --nrules 0 --pktgen --no-bpftrace
> ========================================
> Results (average over 10 runs):
> ========================================
> net-next
> --- -------
> Throughput (pps) 2285421
> Ping RTT avg (ms) 0.161
>
> (slightly adjusted the output to better communicate the results)
>
> So in my case this means BQL implementation has ~5% lower throughput
> compared to net-next. But please double check.
>

Yes, I will benchmark myself.

There are probably some places where we can gain performance.
For example, I see ptr_ring_empty() which could be swapped for
__ptr_ring_empty() which would save a spinlock and unlock.

>>> Seems to work now as expected.
>> Yes, but I think we have to keep these points in mind:
>>
>> 1. Limit/Inflight can be bigger than VETH_RING_SIZE, because
>> packets can be enqueued in the same time as they are read out,
>> so netdev_tx_completed_queue() can theoretically be called with
>> many number of packets.
>> I do not think it is deal-breaking though.
>> I could see such high limits/inflights when looking at the /sys
>> BQL statistics..
>
> For me this makes sense, that inflight just means the number of
> packets not yet 'completed' or the number of packets that you
> can send between two completion calls. I think this is not specific

From my understanding, I do think that this behavior is
pretty specific.

Typically BQL-enabled NIC drivers clear packets out of some
internal buffer in their completion interrupt (or something
similar). And after that they call netdev_tx_completed_queue().

> to this implementation. But for long intervals this might result in
> some problems because you can just fill the veth_ring to its capacity
> quickly, and increasing latency if the receiver is slow.

Yes, but I think the latency can only be approx. as big as the
interval (can be higher with GRO enabled).

> To illustrate this have a look at [1]. There are some plots that
> show the rtt vs. tx-usec config depending on nrules.
>
> [1] https://github.com/jkoeppeler/veth-backpressure-performance-testing/tree/pktgen-and-benchmark/results/tx-usecs
>

Nice plots! Great for finding a sane default value.

>> 2. sched_clock() is only valid on the same CPU. When a different
>> CPU starts executing its sched_clock() can be in the past compared
>> to the sched_clock() value saved by the previous CPU.
>> My trick...
>> min(s->time, sched_clock())
>> ... avoids potentially extremely long intervals between
>> netdev_tx_completed_queue() calls but is not perfect of course.
>> I think CPU hopping happens rarely enough for this to matter..
>> And also we have to keep this in mind [1]:
>> "An architecture may or may not provide an implementation of
>> sched_clock() on its own. If a local implementation is not provided,
>> the system jiffy counter will be used as sched_clock()."
>
> So the problem with this is that with jiffies you have like millisecond
> interval granularity, which might be too long in order to work properly.
> Given the receiver completes 128 packet in 1ms (queue_completed interval),
> the bql will set the limit at 256. Then the tx thread can quickly fill
> the ring, and it then basically stopped until the 1ms interval is over.
>

Linux Mint only has 100Hz jiffies :^)

I think you misunderstand something.

+ if (peer_txq && state->n_bql && ptr_ring_empty(&rq->xdp_ring)) {
+ smp_rmb();
+ if (test_bit(__QUEUE_STATE_STACK_XOFF, &peer_txq->state))
+ veth_bql_complete(state, peer_txq);
+ }

This snippet completes whenever the queue is empty and the
netdev queue is stopped due to BQL.
So then the interval is smaller than 1ms.
I think, that is the reason why the pps is fine for 1ms in
your benchmarks.

> I don't know how much this matters in practice. Not sure which architectures still hit the jiffies fallback.When in doubt we could disable BQL for those, or pin it to a "virtual queue size" via limit_min / limit_max similar to the v5 behavior and call queue_completed for every packet. Wdyt?

Yes, probably not relevant, I think I would just disable it for those
architectures, avoiding possible regressions.

>
>> 3. Inflight can be stuck at a value>0 for a long time when packet
>> enqueueing stops. Only when packets are enqueued again,
>> (on the next veth_xdp_rcv() call,) netdev_tx_completed_queue() is
>> executed and inflight is set to 0 again.
>> Can also be seen when looking at the /sys BQL statistics.
>
> This happens if the ring empties before the next queue_completed fires,
> right?
>

Exactly.

> Another thing: Is it counterintuitive to set the tx_usec_coal on the
> receive device? Because this is a bql related config that is normally
> configured on the TX side?

Yes, I would say so.

>
>> BTW: Yesterday, I worked on and refactored the code into its own .h
>> file as a library and it also works fine for TUN/TAP (+vhost-net)
>> for me :)
>
> Nice! Thank you.
> Jonas
>
>>
>> Thanks for your work!
>> Simon
>>
>> [1] Link: https://docs.kernel.org/timers/timekeeping.html
>>
>>> [1]https://github.com/jkoeppeler/veth-backpressure-performance-testing/tree/pktgen-and-benchmark
>>>
>>> Thanks,
>>> Jonas
>>>
>>>> Thanks!
>>>>
>>>> BTW: I think that this implementation could also work for other
>>>> software interfaces.
>>>>
>>>>> [2] Link:https://medium.com/@tom_84912/byte-queue-limits-the-unauthorized-biography-61adc5730b83
>>>>>
>>>>>> There is an important gotcha. We actually have micro-burst of queuing
>>>>>> (likely due to scheduling noise). Reading BQL stats from /sys will show
>>>>>> BQL inflight=1, but when using the option --hist is it visible that
>>>>>> @inflight have a long tail (see below signature). The "qdisc" output
>>>>>> line also shows this happening via requeues increasing (approx 17/sec in
>>>>>> a test with 567Kpps). (this was with the time-based BQL impl).
>>>>> I understand..
>>>>>