Re: [PATCH net-next 0/6] net/mlx5e: Speedup channel configuration operations

From: Toke Høiland-Jørgensen

Date: Thu Nov 13 2025 - 08:16:42 EST


Tariq Toukan <ttoukan.linux@xxxxxxxxx> writes:

> On 12/11/2025 18:33, Toke Høiland-Jørgensen wrote:
>> Tariq Toukan <ttoukan.linux@xxxxxxxxx> writes:
>>
>>> On 12/11/2025 12:54, Toke Høiland-Jørgensen wrote:
>>>> Tariq Toukan <tariqt@xxxxxxxxxx> writes:
>>>>
>>>>> Hi,
>>>>>
>>>>> This series significantly improves the latency of channel configuration
>>>>> operations, like interface up (create channels), interface down (destroy
>>>>> channels), and channels reconfiguration (create new set, destroy old
>>>>> one).
>>>>
>>>> On the topic of improving ifup/ifdown times, I noticed at some point
>>>> that mlx5 will call synchronize_net() once for every queue when they are
>>>> deactivated (in mlx5e_deactivate_txqsq()). Have you considered changing
>>>> that to amortise the sync latency over the full interface bringdown? :)
>>>>
>>>> -Toke
>>>>
>>>>
>>>
>>> Correct!
>>> This can be improved and I actually have WIP patches for this, as I'm
>>> revisiting this code area recently.
>>
>> Excellent! We ran into some issues with this a while back, so would be
>> great to see this improved.
>>
>> -Toke
>>
>
> Can you elaborate on the test case and issues encountered?
> To make sure I'm addressing them.

Sure, thanks for taking a look!

The high-level issue we've been seeing involves long delays creating and
tearing down OpenShift (Kubernetes) pods that have SR-IOV devices
assigned to them. The worst example of involved a test that basically
reboots an application (tearing down its pods and immediately recreating
them), which takes up to ~10 minutes for ~100 pods.

Because a lot of the wait happens with the RNTL held, we also get
cascading errors to other parts of the system. This is how I ended up
digging into what the mlx5 driver was doing while holding the RTNL,
which is where I noticed the "synchronize_net() in a loop" behaviour.

We're working on reducing the blast radius of the RTNL in general, but
the setup/teardown time seems to be driver specific, so any improvements
here would be welcome, I guess :)

-Toke