Re: [PATCH net-next V2 2/2] net/mlx5: implement max_sfs parameter

From: Nikolay Aleksandrov

Date: Mon May 25 2026 - 02:47:01 EST


On 22/05/2026 21:19, Jakub Kicinski wrote:
This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
net/mlx5: implement max_sfs parameter

This patch adds a generic devlink permanent parameter named max_sfs to
the mlx5 driver so users can cap the number of light-weight NIC
subfunctions via devlink instead of vendor tooling. The setter rewrites
non-volatile firmware fields in nv_global_pci_conf and nv_pf_pci_conf,
and a reboot is required for the change to take effect.

A few questions on the implementation below.

diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst
index 4bba4d780a4a..f5e2dccafa5a 100644
--- a/Documentation/networking/devlink/mlx5.rst
+++ b/Documentation/networking/devlink/mlx5.rst
@@ -45,8 +45,13 @@ Parameters
- The range is between 1 and a device-specific max.
- Applies to each physical function (PF) independently, if the device
supports it. Otherwise, it applies symmetrically to all PFs.
+ * - ``max_sfs``
+ - permanent
+ - The range is between 0 and a device-specific max.
+ - Applies to each physical function (PF) independently.

[Medium, Medium]
The documentation states the range is "between 0 and a device-specific
max", and the entry says max_sfs "Applies to each physical function (PF)
independently" with no "if the device supports it" qualifier like
total_vfs has.

Looking at the implementation in mlx5_devlink_max_sfs_set(), there is no
read of any nv_global_pci_cap field to obtain a device-reported maximum,
and the registration uses NULL for the validate callback:

DEVLINK_PARAM_GENERIC(MAX_SFS, BIT(DEVLINK_PARAM_CMODE_PERMANENT),
mlx5_devlink_max_sfs_get,
mlx5_devlink_max_sfs_set, NULL),

Should there be a validate callback analogous to
mlx5_devlink_total_vfs_validate() that reads the cap and rejects values
above the device-reported maximum, so the documented "device-specific
max" is actually enforced?


I don't know of such cap, if there was I'd have added it.

Should the doc also include the "if the device supports it" qualifier
to match how the setter behaves on devices without the relevant
capability?

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/nv_param.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/nv_param.c
index 19bb620b7436..eff3a67e4ca0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/nv_param.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/nv_param.c
@@ -68,7 +68,9 @@ struct mlx5_ifc_mnvda_reg_bits {
struct mlx5_ifc_nv_global_pci_conf_bits {
u8 sriov_valid[0x1];
- u8 reserved_at_1[0x10];
+ u8 reserved_at_1[0xa];
+ u8 per_pf_num_sf[0x1];
+ u8 reserved_at_c[0x5];
u8 per_pf_total_vf[0x1];
u8 reserved_at_12[0xe];

[ ... ]

@@ -93,9 +95,11 @@ struct mlx5_ifc_nv_global_pci_cap_bits {
};
struct mlx5_ifc_nv_pf_pci_conf_bits {
- u8 reserved_at_0[0x9];
+ u8 log_sf_bar_size[0x8];
+ u8 pf_total_sf_en[0x1];
u8 pf_total_vf_en[0x1];
- u8 reserved_at_a[0x16];
+ u8 reserved_at_a[0x6];
+ u8 total_sf[0x10];
u8 reserved_at_20[0x20];

[Medium]
The on-wire field total_sf is 16 bits wide, but the devlink parameter
is registered as DEVLINK_PARAM_TYPE_U32 (max_sfs uses
DEVLINK_PARAM_GENERIC_MAX_SFS_TYPE, which is u32) and the setter writes
ctx->val.vu32 directly via MLX5_SET(nv_pf_pci_conf, data, total_sf,
ctx->val.vu32).

MLX5_SET masks the value with __mlx5_mask(typ, fld) before storing, so
any user-supplied value above 0xffff is silently truncated. For
example, "devlink dev param set ... name max_sfs value 70000" would
persistently store 4464.

Without a validate callback, is there anything that prevents this
silent truncation, and should the setter or a validate hook reject
values that do not fit in 16 bits?


This is benign but I'll cap it at 16 bits for the next version.

@@ -755,6 +759,76 @@ static int mlx5_devlink_total_vfs_validate(struct devlink *devlink, u32 id,
return 0;
}
+static int mlx5_devlink_max_sfs_get(struct devlink *devlink, u32 id,
+ struct devlink_param_gset_ctx *ctx,
+ struct netlink_ext_ack *extack)
+{
+ struct mlx5_core_dev *dev = devlink_priv(devlink);
+ u32 mnvda[MLX5_ST_SZ_DW(mnvda_reg)] = {};
+ void *data;
+ int err;
+
+ err = mlx5_nv_param_read_per_host_pf_conf(dev, mnvda, sizeof(mnvda));
+ if (err) {
+ NL_SET_ERR_MSG_MOD(extack, "Failed to read PF configuration");
+ return err;
+ }
+
+ data = MLX5_ADDR_OF(mnvda_reg, mnvda, configuration_item_data);
+ ctx->val.vu32 = MLX5_GET(nv_pf_pci_conf, data, total_sf);
+
+ return 0;
+}

[Medium]
Should max_sfs_get() mirror mlx5_devlink_total_vfs_get() and consult
nv_global_pci_conf.per_pf_num_sf before reporting nv_pf_pci_conf.total_sf?

The setter explicitly conditions the meaning of per-PF total_sf on
per_pf_num_sf:

MLX5_SET(nv_global_pci_conf, data, per_pf_num_sf, !!ctx->val.vu32);
...
MLX5_SET(nv_pf_pci_conf, data, total_sf, ctx->val.vu32);

But the getter reads total_sf unconditionally, so on a device whose
firmware was previously configured by another tool with per_pf_num_sf=0
and a stale per-host total_sf, "devlink dev param show" would return a
non-zero value that does not represent the active SF policy. Is that
intentional?


No.

+static int mlx5_devlink_max_sfs_set(struct devlink *devlink, u32 id,
+ struct devlink_param_gset_ctx *ctx,
+ struct netlink_ext_ack *extack)
+{
+ struct mlx5_core_dev *dev = devlink_priv(devlink);
+ u32 mnvda[MLX5_ST_SZ_DW(mnvda_reg)] = {};
+ void *data;
+ int err;
+
+ err = mlx5_nv_param_read_global_pci_conf(dev, mnvda, sizeof(mnvda));
+ if (err) {
+ NL_SET_ERR_MSG_MOD(extack,
+ "Failed to read global PCI configuration");
+ return err;
+ }
+
+ data = MLX5_ADDR_OF(mnvda_reg, mnvda, configuration_item_data);
+ MLX5_SET(nv_global_pci_conf, data, per_pf_num_sf, !!ctx->val.vu32);
+
+ err = mlx5_nv_param_write(dev, mnvda, sizeof(mnvda));
+ if (err) {
+ NL_SET_ERR_MSG_MOD(extack,
+ "Failed to change per_pf_num_sf global PCI configuration");
+ return err;
+ }

[Medium]
Should this path first read nv_global_pci_cap and reject the operation
when the device does not advertise SF NV configuration, the way
mlx5_devlink_total_vfs_set() does for SR-IOV?

mlx5_devlink_total_vfs_set() rejects with -EOPNOTSUPP and an extack
"SRIOV is not per PF on this device" when sriov_support or
per_pf_total_vf_supported is clear, but no equivalent
per_pf_num_sf_supported (or any SF-related) capability bit is added to
nv_global_pci_cap_bits or queried here. On hardware that lacks the
feature, the user only sees the generic firmware error "Failed to
change ... global PCI configuration".


I don't know of such bit, if there was such bit I'd have added it.

[Medium]
At this point, a successful write of per_pf_num_sf has already been
committed to non-volatile firmware storage. If the subsequent
mlx5_nv_param_read_per_host_pf_conf() or the second mlx5_nv_param_write()
fails (for example, transient firmware/PCIe issue), is there a path that
rolls back per_pf_num_sf?

Because these are permanent parameters that "require a reboot to take
effect", a half-applied state (e.g., per_pf_num_sf=1 but
pf_total_sf_en=0/total_sf=0) appears to persist across reboots until
the user issues another successful set.


sure, that is expected

+
+ memset(mnvda, 0, sizeof(mnvda));
+ err = mlx5_nv_param_read_per_host_pf_conf(dev, mnvda, sizeof(mnvda));
+ if (err) {
+ NL_SET_ERR_MSG_MOD(extack, "Failed to read PF configuration");
+ return err;
+ }
+
+ data = MLX5_ADDR_OF(mnvda_reg, mnvda, configuration_item_data);
+ MLX5_SET(nv_pf_pci_conf, data, log_sf_bar_size, ctx->val.vu32 ? 12 : 0);

[Low]
What is the meaning of the literal 12 here, and why is it correct on
all supported devices?


It is a sane default, some day we can make it configurable as well.

Elsewhere in the driver, the SF BAR length is computed from a
hardware-reported capability rather than a fixed value:

drivers/net/ethernet/mellanox/mlx5/core/sf/dev/dev.c:
sf_bar_length = 1 << (MLX5_CAP_GEN(dev, log_min_sf_size) + 12);

Should log_sf_bar_size be derived from MLX5_CAP_GEN(dev,
log_min_sf_size) (or another cap) rather than hardcoded?


Interesting, I'll check this out.

The commit message says max_sfs is to "control the total light-weight
NIC subfunctions"; the BAR-size side-effect is not mentioned, and any
previously configured log_sf_bar_size is overwritten on every
max_sfs set. Should that behavior be documented or split out from the
count knob?


It should be documented when it is split out, at present time - no.

+ MLX5_SET(nv_pf_pci_conf, data, pf_total_sf_en, !!ctx->val.vu32);
+ MLX5_SET(nv_pf_pci_conf, data, total_sf, ctx->val.vu32);
+
+ err = mlx5_nv_param_write(dev, mnvda, sizeof(mnvda));
+ if (err) {
+ NL_SET_ERR_MSG_MOD(extack,
+ "Failed to change PF PCI configuration");
+ return err;
+ }
+ NL_SET_ERR_MSG(extack, "Modifying max_sfs requires a reboot");
+
+ return 0;
+}