Re: [PATCH net-next v5 3/5] veth: implement Byte Queue Limits (BQL) for latency reduction

From: Simon Schippers

Date: Wed May 27 2026 - 04:47:53 EST

On 5/27/26 09:38, Jesper Dangaard Brouer wrote:
>
>
> On 26/05/2026 17.07, Simon Schippers wrote:
>> On 5/26/26 16:55, Jonas Köppeler wrote:
>>> On 5/26/26 4:35 PM, Simon Schippers wrote:
>>>> On 5/26/26 11:54, Jonas Köppeler wrote:
>>>>> On 5/23/26 6:09 PM, Simon Schippers wrote:
>>>>>> On 5/22/26 18:26, Jonas Köppeler wrote:
>>>>>>> On 5/22/26 10:41, Simon Schippers wrote:
>>>>>>>> On 5/22/26 09:14, Jonas Köppeler wrote:
>>>>>>>>> On 5/19/26 10:51 PM, Simon Schippers wrote:
>>>>>>>>>> On 5/12/26 23:55, Simon Schippers wrote:
>>>>>>>>>>> On 5/12/26 15:54, Jesper Dangaard Brouer wrote:
>>>>>>>>>>>>>> Nope, I'm using a bpftrace program to keep track of the inflight/limit
>>>>>>>>>>>>>> in a BPF hashmap. Reading from /sys will not be accurate.
>>>>>>>>>>>>> Ah nice.
>>>>>>>>>>>> Add the option --hist to have both NAPI and BQL histograms printed when
>>>>>>>>>>>> script ends. This will give you an accurate pattern of how inflight and
>>>>>>>>>>>> limit evolves.
>>>>>>>>>>>>
>>>>>>>>>>>>>> I moved the selftests into a github repo [1] to allow us to collaborate
>>>>>>>>>>>>>> and evaluate the changes more easily. I explicitly kept the new BPF
>>>>>>>>>>>>>> based BQL tracking as a commit[2] for your benefit.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> [1] https://github.com/netoptimizer/veth-backpressure-performance-testing/tree/main/selftests
>
> [... cut ...]
>
>>>>
>>>> I will wait for your new measurements, but there is no argument
>>>> against a default tx-usecs of ~100us for now, right?
>>> Yes, I think 100us is perfectly fine. I guess most of it was
>>> just my curiosity why the latency values are as they are 🙂
>> Which is great, because I was wondering the same 🙂
>>
>
> Thank you Jonas and Simon for testing this via[1] on your systems.
>
> One performance concern from my side is if/when BQL limit goes below 8
> packets. This will cause cache-line bouncing and many qdisc requeues
> between the two CPUs. Notice that 8 packets for the ptr_ring is one

I think the qdisc requeues are an issue of dc82a33297fc ("veth: apply
qdisc backpressure on full ptr_ring to reduce TX drops").

It could be solved by stopping the netdev queue after inserting
the last element like I did in my tun tap implementation [1].
I know, you also did this before in a previous implementation, but
the required re-check used __ptr_ring_empty() which is not allowed as
producer.
That is the reason why I introduced the new __ptr_ring_check_produce()
in [1] which is safe for the producer to call.

[1] Link: https://lore.kernel.org/netdev/20260510151529.43895-5-simon.schippers@xxxxxxxxxxxxxx/

> cache-line. This is why I suggested defaulting BQL min_limit to be 8.
> This would work in combination with the tx-usecs coalesce tuning as a
> lower bound.

Yes, it would work but I think it is fine as-is.
BQL will only choose such a small limit if the consumer is really
slow anyway.