Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction

From: Simon Schippers

Date: Fri May 08 2026 - 04:03:54 EST

On 5/7/26 22:45, Jesper Dangaard Brouer wrote:
>
>
> On 07/05/2026 22.12, Simon Schippers wrote:
>> On 5/7/26 21:09, Jesper Dangaard Brouer wrote:
>>>
>>>
>>> On 07/05/2026 16.46, Simon Schippers wrote:
>>>>
>>>>
>>>> On 5/7/26 16:34, Paolo Abeni wrote:
>>>>> On 5/7/26 8:54 AM, Simon Schippers wrote:
>>>>>> On 5/5/26 15:21, hawk@xxxxxxxxxx wrote:
>>>>>>> @@ -928,9 +968,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>>>>>> }
>>>>>>> } else {
>>>>>>> /* ndo_start_xmit */
>>>>>>> - struct sk_buff *skb = ptr;
>>>>>>> + bool bql_charged = veth_ptr_is_bql(ptr);
>>>>>>> + struct sk_buff *skb = veth_ptr_to_skb(ptr);
>>>>>>> stats->xdp_bytes += skb->len;
>>>>>>> + if (peer_txq && bql_charged)
>>>>>>> + netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
>>>>>>
>>>>>> In the discussion with Jonas [1], I left a comment explaining why I think
>>>>>> this doesn’t work.
>>>>>>
>>>
>>> I've experimented with doing the "completion" at NAPI-end in
>>> veth_poll(), but that resulted in BQL limit being 128 packets, which
>>> leads to bad latency results (not acceptable).
>>> (See detailed report later)
>>>
>>>
>>>>>> I still think first that adding an option to modify the hard-coded
>>>>>> VETH_RING_SIZE is the way to go.
>>>>>>
>>>
>>> Not against being able to modify VETH_RING_SIZE, but I don't think it is
>>> the solution here.
>>>
>>> The simply solution is the configure BQL limit_min:
>>> `/sys/class/net/<dev>/queues/tx-N/byte_queue_limits/limit_min`
>>>
>>> My experiments (below) find that limit_min=8 is gives good performance.
>>> We can simply set default to 8 as this still allows userspace to change
>>> this later if lower latency is preferred.
>>>
>>>>>> Thanks!
>>>>>>
>>>>>> [1] Link: https://lore.kernel.org/netdev/e8cdba04-aa9a-45c6-9807-8274b62920df@xxxxxxxxxxxxxx/
>>>>>
>>>>> In the above discussion a 20% regression is reported, which IMHO can't
>>>>> be ignored. Still the tput figures in the data are extremely low,
>>>>> something is possibly off?!? I would expect a few Mpps with pktgen on
>>>>> top of veth, while the reported data is ~20-30Kpps.
>>>>>
>>>>> /P
>>>>>
>>>>
>>>> The ~20-30Kpps occur when thousands of iptables rules are applied and
>>>> an UDP userspace application is sending.
>>>>
>>>> And there is a 20% pktgen regression (no iptables rules applied).
>>>>
>>>
>>> The pktgen test is a little dubious/weird and Jonas had to modify pktgen
>>> to test this. John Fastabend added a config to pktgen that allows us
>>> to benchmarking egress qdisc path, this might be better to use this.
>>> The samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh is a demo usage.
>>>
>>> If redoing the tests, can you adjust limit_min to see the effect?
>>> /sys/class/net/<dev>/queues/tx-N/byte_queue_limits/limit_min
>>>
>>> 20% throughput performance regression is of-cause too much, but I will
>>> remind us, that adding a qdisc will "cost" some overhead, that is a
>>> configuration choice. Our purpose here is to reduce bufferbloat and
>>> latency, not optimize for throughput.
>>>
>>>
>>>> I am pretty sure the reason is because the BQL limit is stuck at 2
>>>> packets (because the completed queue is always called with 1 packet
>>>> and not in a interrupt/timer with multiple packets...).
>>>>
>>>
>>> I've run a lot of experiments, which I made AI write a report over, see attachment. The TL;DR is that best performance vs latency tradeoff is defaulting BQL/DQL limit_min to be 8 packets.
>>>
>>> I fear this patchset will stall forever, if we keep searching for a perfect solution without any overhead. The qdisc layer will be a baseline overhead. The limit=2 packets is actually the optimal darkbuffer queue size, but I acknowledge that this causes too many qdisc requeue events (leading to overhead). I suggest that I add another patch in V6, that defaults limit_min to 8 (separate patch to make it easier to revert/adjust later).
>>>
>>> I've talked with Jonas, and we want to experiment with different solutions to make BQL/DQL work better with virtual devices.
>>>
>>> This patchset helps our (production) use-case reduce mice-flow latency
>>> from approx 22ms to 1.3ms for latency under-load. Due to the consumer
>>> namespace being the bottleneck the requeue overhead is negligible in
>>> comparison.
>>>
>>> -Jesper
>>
>> First of all thanks for you work and I really see the advantages of
>> avoiding bufferbloat :)
>>
>> But the key of the BQL algorithm, which is the *dynamic* adaption of the
>> limit, is not working. Always calling netdev_completed_queue() with
>> 1 packet results in a static limit of 2 packets (as seen by Jonas
>> measurements), which you force up to 8 packets.
>>
>> So in the end this patchset has the same effect as just setting
>> VETH_RING_SIZE to 8 (and giving an option to change this value).
>>
>
> I've code up a time based BQL implementation, see attachment.
> WDYT?
>
> --Jesper
>

A step in the right direction, but I dislike that you call
netdev_sent_queue() with at least 1 packet (never 0 packets).
I am not sure if it works, and I am not sure about the parameter.

I would propose doing it like other BQL implementations do
(for example usbnet for which I adapted BQL [1] :) ):

Call netdev_sent_queue() with n_bql in a periodic work. n_bql would
still be counted in veth_xdp_rcv() like you currently do (synchronized
with the work via ring.consumer_lock?).

The only weird thing that remains is that BQL's inflight != number of
packets in the ring and BQL's limit != "current ring size". Instead
the BQL limit describes the number of maximal allowed packets between
calls of netdev_sent_queue(), which occur periodically in a somewhat
fixed time interval.
I guess that could be fine, but it surely needs testing.

[1] Link: https://lore.kernel.org/netdev/20251106175615.26948-1-simon.schippers@xxxxxxxxxxxxxx/