Re: [PATCH net-next] net/mlx5: Allow asynchronous probe

From: Tariq Toukan

Date: Thu Mar 05 2026 - 02:59:11 EST




On 03/03/2026 12:33, Gerd Bayer wrote:
Announce that mlx5_core supports asynchronous probing.


Hi Gerd,
Interesting patch.

Tests on s390 - where VFs can show up isolated from their PF in OS
instances - showed symptoms of "mlx5_core: probe of 00e7:00:00.0 failed
with error -12" when booting a system with a large number (> 250) of
Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
(15b3:101e) PCI functions.

Turns out that this is due to systemd-udev's time-out supervision of
"modprobe" killing the sequential initialization of additional functions
if probing exceeds a default of 180 seconds.

According to [1] device drivers could (slow ones should!) opt-in to have
their probe step being executed asynchronously - and interleaved. With
the mlx5_core device driver announcing PROBE_PREFER_ASYNCHRONOUS as
proposed by this patch, we've measured 275 VFs being probed successfully
in about 60 seconds.


Nice.

[1] https://www.kernel.org/doc/html/latest/driver-api/infrastructure.html

Signed-off-by: Gerd Bayer <gbayer@xxxxxxxxxxxxx>
---
Hi all,

this patch helps to speed up boot times when there are a large numbers
of Mellanox/NVidia VFs in a configuration. Although we've seens real
issues, I'm hesitating to declare this a fix of commit 9603b61de1ee
("mlx5: Move pci device handling from mlx5_ib to mlx5_core") primarily
because the concept of asynchronous probing with commit 765230b5f084
("driver-core: add asynchronous probing support for drivers") was
introduced only later.

Thanks,
Gerd Bayer
---

This is an interesting problem, and the proposed solution looks reasonable. That said, this is a very sensitive area and there may still be hidden assumptions or corner cases we haven't considered. This needs thorough testing across a wide range of real-world scenarios and different system topologies before we can be confident in it.

We'll take this for testing and report back once we have results.

BTW, as you probably know, a possible workaround is to increase the systemd-udev timeout.
What timeout is required for it to succeed without this change?

drivers/net/ethernet/mellanox/mlx5/core/main.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index fdc3ba20912e4fbc53c65825c62e868996eff56d..b53fc3f2566acf5a07cb8df649124c4a87f3e043 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -2306,6 +2306,9 @@ static struct pci_driver mlx5_core_driver = {
.sriov_configure = mlx5_core_sriov_configure,
.sriov_get_vf_total_msix = mlx5_sriov_get_vf_total_msix,
.sriov_set_msix_vec_count = mlx5_core_sriov_set_msix_vec_count,
+ .driver = {
+ .probe_type = PROBE_PREFER_ASYNCHRONOUS,
+ }
};
/**

---
base-commit: c69855ada28656fdd7e197b6e24cd40a04fe14d3
change-id: 20260303-parprobe_mlx5-d10d2a746d3a

Best regards,