Re: [PATCH net-next v2 3/6] pds_core: add PLDM firmware update support via devlink flash

From: Jakub Kicinski

Date: Wed May 20 2026 - 19:54:21 EST

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
---
pds_core: add PLDM firmware update support via devlink flash

This adds PLDM firmware update support to pds_core via the upstream
pldmfw API, allowing flashing of entire packages or individual
components through devlink dev flash.

> diff --git a/drivers/net/ethernet/amd/pds_core/fw.c b/drivers/net/ethernet/amd/pds_core/fw.c
> index fa626719e68d..f091a753bce9 100644
> --- a/drivers/net/ethernet/amd/pds_core/fw.c
> +++ b/drivers/net/ethernet/amd/pds_core/fw.c

[ ... ]

> +int pdsc_get_component_info(struct pdsc *pdsc)
> +{
> + union pds_core_dev_cmd cmd = {
> + .get_component_info.opcode = PDS_CORE_CMD_GET_COMPONENT_INFO,
> + .get_component_info.ver = 1,
> + };
> + struct pds_core_component_list_info *list_info;
> + union pds_core_dev_comp comp = {};
> + dma_addr_t dma_addr;
> + u8 num_components;
> + int err, i;
> +
> + list_info = kzalloc(PDS_PAGE_SIZE, GFP_KERNEL);
> + if (!list_info)
> + return -ENOMEM;
> +
> + dma_addr = dma_map_single(pdsc->dev, list_info, PDS_PAGE_SIZE,
> + DMA_FROM_DEVICE);
> + if (dma_mapping_error(pdsc->dev, dma_addr)) {
> + dev_err(pdsc->dev,
> + "Failed to dma_map component_list_info length %d\n",
> + PDS_PAGE_SIZE);
> + err = -ENOMEM;
> + goto out;
> + }
> +
> + cmd.get_component_info.data_len = cpu_to_le16(PDS_PAGE_SIZE);
> + cmd.get_component_info.data_pa = cpu_to_le64(dma_addr);
> +
> + err = pdsc_devcmd(pdsc, &cmd, &comp, pdsc->devcmd_timeout * 2);
> + dma_unmap_single(pdsc->dev, dma_addr, PDS_PAGE_SIZE, DMA_FROM_DEVICE);
> + if (err)
> + goto out;
> +
> + if (comp.get_component_info.ver == 0) {
> + /* Don't support backward compatibility as version 0 has
> + * alignment issues, so give a hint to users to update
> + * their firmware
> + */
> + dev_warn(pdsc->dev,
> + "Incompatible get_component_info version %u reported by firmware\n",
> + comp.get_component_info.ver);
> + err = 0;
> + goto out;
> + }

[Medium]
The comment states version 0 is not supported because of alignment
issues, but the code sets err = 0 and falls through with
pdsc->fw_components.num_components left at zero.

The caller pdsc_pldm_firmware_update() treats err == 0 as success and
calls pldmfw_flash_image() with no cached components, which leads to
a few user-visible side effects:

- When a user passes a specific component, pdsc_get_slot_id() iterates
an empty fw_components array and returns PDS_CORE_FW_SLOT_MAX, which
bubbles up as -ENOENT from pdsc_send_component_table() rather than a
"firmware too old" indication.

- For a full-image flash, pdsc_set_component_name() finds no matches,
so devlink_flash_update_status_notify() ends up reporting empty
component strings throughout the flash.

- pdsc_pldm_firmware_update() resets num_components = 0 at the end of
every flash, so the dev_warn and the same downstream paths recur on
every subsequent attempt.

Should this path return an error such as -EOPNOTSUPP so the documented
intent matches what callers see?

[ ... ]

> +static int pdsc_devcmd_send_component(struct pdsc *pdsc,
> + struct pds_core_flash_component *info,
> + u16 info_sz, dma_addr_t addr, u32 length,
> + u32 offset, u16 slot_id,
> + union pds_core_dev_comp *comp)
> +{
> + union pds_core_dev_cmd cmd = {
> + .send_component.opcode = PDS_CORE_CMD_SEND_COMPONENT,
> + .send_component.ver = 1,
> + .send_component.operation = PDS_CORE_SEND_COMPONENT_START,
> + .send_component.data_pa = cpu_to_le64(addr),
> + .send_component.data_len = cpu_to_le32(length),
> + .send_component.offset = cpu_to_le32(offset),
> + .send_component.slot_id = slot_id,
> + };
> + unsigned long timeout = 300 * HZ;
> + unsigned long start_time;
> + unsigned long end_time;
> + int err;
> +
> + start_time = jiffies;
> + end_time = start_time + timeout;
> + do {
> + /* prevent noisy/benign devcmd failures */
> + err = pdsc_devcmd_with_data_nomsg(pdsc, &cmd, info, info_sz,
> + comp, 60);
> + if (err != -EAGAIN)
> + break;
> +
> + /* if required, subsequent commands check status of
> + * PDS_CORE_CMD_SEND_COMPONENT command, which returns
> + * EAGAIN while the command is still running,
> + * else we get the final command status.
> + */
> + cmd.send_component.operation = PDS_CORE_SEND_COMPONENT_STATUS;
> + msleep(20);
> + } while (time_before(jiffies, end_time));
> +
> + if (err == -EAGAIN || err == -ETIMEDOUT)
> + dev_err(pdsc->dev, "PDS_CORE_CMD_SEND_COMPONENT timed out\n");
> +
> + return err;
> +}

[ ... ]

> +static int pdsc_flash_component(struct pldmfw *context,
> + struct pldmfw_component *component)
> +{

[ ... ]

> + err = pdsc_devcmd_send_component(pdsc, component_info, info_sz,
> + dma_addr, copy_sz, offset,
> + slot_id, &comp);
> + dma_unmap_single(dev, dma_addr, copy_sz, DMA_TO_DEVICE);
> + kfree(component_data);

[High]
On the -EAGAIN or -ETIMEDOUT exit from pdsc_devcmd_send_component(),
can the device still be DMA-reading from this buffer when the unmap
and kfree happen here?

The polling loop in pdsc_devcmd_send_component() runs up to 300 seconds
and the only exits when the device is still busy are -EAGAIN (the
status query explicitly says START has not completed) or -ETIMEDOUT.
No abort or cancel opcode is issued, and the wire format documented
for struct pds_core_send_component_cmd hands data_pa to the device on
the START call.

Looking at __pdsc_devcmd_locked() in pds_core/dev.c:

if ((err == -ENXIO || err == -ETIMEDOUT) && pdsc->wq)
queue_work(pdsc->wq, &pdsc->health_work);

-EAGAIN does not even trigger health_work, so the device is not
treated as hung in that case.

Once control returns to pdsc_flash_component():

err = pdsc_devcmd_send_component(...);
dma_unmap_single(dev, dma_addr, copy_sz, DMA_TO_DEVICE);
kfree(component_data);

If the firmware finishes the flash a moment later and reads from
data_pa, on IOMMU systems the unmap can race the DMA and produce
IOMMU faults; on non-IOMMU systems the device can DMA-read from
freed slab memory belonging to a later allocation.

Should this path leak the buffer, reset the device, or send an abort
opcode to the firmware before unmapping and freeing on the busy
timeout exits?

[ ... ]