RE: [PATCH 09/20] vfio/cxl: Implement CXL device detection and HDM register probing

From: Manish Honap

Date: Wed Mar 18 2026 - 13:43:42 EST




> -----Original Message-----
> From: Jonathan Cameron <jonathan.cameron@xxxxxxxxxx>
> Sent: 13 March 2026 18:14
> To: Dave Jiang <dave.jiang@xxxxxxxxx>
> Cc: Manish Honap <mhonap@xxxxxxxxxx>; Aniket Agashe <aniketa@xxxxxxxxxx>;
> Ankit Agrawal <ankita@xxxxxxxxxx>; Alex Williamson
> <alwilliamson@xxxxxxxxxx>; Vikram Sethi <vsethi@xxxxxxxxxx>; Jason
> Gunthorpe <jgg@xxxxxxxxxx>; Matt Ochs <mochs@xxxxxxxxxx>; Shameer Kolothum
> Thodi <skolothumtho@xxxxxxxxxx>; alejandro.lucero-palau@xxxxxxx;
> dave@xxxxxxxxxxxx; alison.schofield@xxxxxxxxx; vishal.l.verma@xxxxxxxxx;
> ira.weiny@xxxxxxxxx; dan.j.williams@xxxxxxxxx; jgg@xxxxxxxx; Yishai Hadas
> <yishaih@xxxxxxxxxx>; kevin.tian@xxxxxxxxx; Neo Jia <cjia@xxxxxxxxxx>;
> Tarun Gupta (SW-GPU) <targupta@xxxxxxxxxx>; Zhi Wang <zhiw@xxxxxxxxxx>;
> Krishnakant Jaju <kjaju@xxxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; linux-
> cxl@xxxxxxxxxxxxxxx; kvm@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH 09/20] vfio/cxl: Implement CXL device detection and
> HDM register probing
>
> External email: Use caution opening links or attachments
>
>
> On Thu, 12 Mar 2026 15:31:03 -0700
> Dave Jiang <dave.jiang@xxxxxxxxx> wrote:
>
> > On 3/11/26 1:34 PM, mhonap@xxxxxxxxxx wrote:
> > > From: Manish Honap <mhonap@xxxxxxxxxx>
> > >
> > > Implement the core CXL Type-2 device detection and component
> > > register probing logic in vfio_pci_cxl_detect_and_init().
> > >
> > > Three private helpers are introduced:
> > >
> > > vfio_cxl_create_device_state() allocates the per-device
> > > vfio_pci_cxl_state structure using devm_cxl_dev_state_create() so
> > > that lifetime is tied to the PCI device binding.
> > >
> > > vfio_cxl_find_bar() locates the PCI BAR that contains a given HPA
> > > range, returning the BAR index and offset within it.
> > >
> > > vfio_cxl_setup_regs() uses the CXL core helpers cxl_find_regblock()
> > > and cxl_probe_component_regs() to enumerate the HDM decoder register
> > > block, then records its BAR index, offset and size in the CXL state.
> > >
> > > vfio_pci_cxl_detect_and_init() orchestrates detection:
> > > 1. Check for CXL DVSEC via pcie_is_cxl() +
> pci_find_dvsec_capability().
> > > 2. Allocate CXL device state.
> > > 3. Temporarily call pci_enable_device_mem() for ioremap, then
> disable.
> > > 4. Probe component registers to find the HDM decoder block.
> > >
> > > On any failure vdev->cxl is devm_kfree'd so that device falls back
> > > to plain PCI mode transparently.
> > >
> > > Signed-off-by: Manish Honap <mhonap@xxxxxxxxxx>
> Given Dave didn't crop anything I'll just reply on top and avoid
> duplication.
>
> Mostly lifetime handling comments. I get nervous when devm occurs in the
> middle of non devm code. It needs to be done very carefully. I don't think
> you have a bug here, but I'm not keen on the resulting difference in order
> of setup and tear down. I'd like cleanup to tidy it up even though it
> would be safe to do later.
>

I have addressed these comments by creating a local variable before assigning
cxl to vdev->cxl.

>
> > > ---
> > > drivers/vfio/pci/cxl/vfio_cxl_core.c | 151
> +++++++++++++++++++++++++++
> > > drivers/vfio/pci/cxl/vfio_cxl_priv.h | 8 ++
> > > 2 files changed, 159 insertions(+)
> > >
> > > diff --git a/drivers/vfio/pci/cxl/vfio_cxl_core.c
> > > b/drivers/vfio/pci/cxl/vfio_cxl_core.c
> > > index 7698d94e16be..2da6da1c0605 100644
> > > --- a/drivers/vfio/pci/cxl/vfio_cxl_core.c
> > > +++ b/drivers/vfio/pci/cxl/vfio_cxl_core.c
> > > @@ -18,6 +18,114 @@
> > >
> > > MODULE_IMPORT_NS("CXL");
> > >
> > > +static int vfio_cxl_create_device_state(struct vfio_pci_core_device
> *vdev,
> > > + u16 dvsec) {
> > > + struct pci_dev *pdev = vdev->pdev;
> > > + struct device *dev = &pdev->dev;
> > > + struct vfio_pci_cxl_state *cxl;
> > > + bool cxl_mem_capable, is_cxl_type3;
> > > + u16 cap_word;
> > > +
> > > + /*
> > > + * The devm allocation for the CXL state remains for the entire
> time
> > > + * the PCI device is bound to vfio-pci. From successful CXL init
> > > + * in probe until the device is released on unbind.
> > > + * No extra explicit free is needed; devm handles it when
> > > + * pdev->dev is released.
> > > + */
> > > + vdev->cxl = devm_cxl_dev_state_create(dev,
>
> Rather than assigning this in here, I'd use a local variable for the
> return of this, operate on that and return it from the function.
> That both creates a clean separation and possibly make error handling
> simpler later...
>
> Also similar to some of the feedback Alejandro had on his type 2 series,
> be very careful with mixing devm calls and non devm, generally that's a
> path to hard to read and reason about code.
> I know you've stated this is fine because it's tied to the PCI device
> lifetime and you are probably right on that, but I'm not keen.
>
> This might be a case where manual unwinding of devres is needed (maybe
> using devres groups if we end up with a bunch of stuff to undo).
>
>
>
> > > + CXL_DEVTYPE_DEVMEM,
> > > + pdev->dev.id, dvsec,
> > > + struct vfio_pci_cxl_state,
> > > + cxlds, false);
> > > + if (!vdev->cxl)
> > > + return -ENOMEM;
> > > +
> > > + cxl = vdev->cxl;
> > > + cxl->dvsec = dvsec;
>
> That's a bit odd given we pass dvsec into devm_cxl_dev_state_create() why
> doesn't it assign it?

Yes, dvsec was redundant in cxl->dvsec and cxl->cxlds.cxl_dvsec
Removed it from cxl.

>
> > > +
> > > + pci_read_config_word(pdev, dvsec + CXL_DVSEC_CAPABILITY_OFFSET,
> > > + &cap_word);
> > > +
> > > + cxl_mem_capable = !!(cap_word & CXL_DVSEC_MEM_CAPABLE);
> > > + is_cxl_type3 = ((pdev->class >> 8) == PCI_CLASS_MEMORY_CXL);
> >
> > Both of these can use FIELD_GET().

I have used FIELD_GET for MEM_CAPABLE flag. For pdev->class, I see there are
other occurrences (drivers/pci/pcie/aer_cxl_rch.c) which use same check for detecting
memory class. So, I have kept the is_cxl_type3 check as it is.

> >
> > > +
> > > + /*
> > > + * Type 2 = CXL memory capable but NOT Type 3 (e.g.
> accelerator/GPU)
> > > + * Unsupported for non cxl type-2 class of devices.
>
> As in other places, type 3 doesn't mean that. What you mean is class code
> compliant type 3.

Updated the comments and checks.

>
> > > + */
> > > + if (!(cxl_mem_capable && !is_cxl_type3)) {
> > > + devm_kfree(&pdev->dev, vdev->cxl);
>
> As below. That needs a name that makes it clear it is the right devm call.
>
> > > + vdev->cxl = NULL;
> > > + return -ENODEV;
> > > + }
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +static int vfio_cxl_setup_regs(struct vfio_pci_core_device *vdev) {
> > > + struct vfio_pci_cxl_state *cxl = vdev->cxl;
> > > + struct cxl_register_map *map = &cxl->cxlds.reg_map;
> > > + resource_size_t offset, bar_offset, size;
> > > + struct pci_dev *pdev = vdev->pdev;
> > > + void __iomem *base;
> > > + u32 count;
> > > + int ret;
> > > + u8 bar;
> > > +
> > > + if (WARN_ON_ONCE(!pci_is_enabled(pdev)))
> > > + return -EINVAL;
> > > +
> > > + /* Find component register block via Register Locator DVSEC */
> > > + ret = cxl_find_regblock(pdev, CXL_REGLOC_RBI_COMPONENT, map);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + /* Temporarily map the register block */
> > > + base = ioremap(map->resource, map->max_size);
> >
> > Request the mem region before mapping it?

Yes, added the call to request region first before mapping.

> >
> > DJ
> >
> > > + if (!base)
> > > + return -ENOMEM;
> > > +
> > > + /* Probe component register capabilities */
> > > + cxl_probe_component_regs(&pdev->dev, base, &map->component_map);
> > > +
> > > + /* Unmap immediately */
> > > + iounmap(base);
> > > +
> > > + /* Check if HDM decoder was found */
> > > + if (!map->component_map.hdm_decoder.valid)
> > > + return -ENODEV;
> > > +
> > > + pci_dbg(pdev,
> > > + "vfio_cxl: HDM decoder at offset=0x%lx, size=0x%lx\n",
> > > + map->component_map.hdm_decoder.offset,
> > > + map->component_map.hdm_decoder.size);
> > > +
> > > + /* Get HDM register info */
> > > + ret = cxl_get_hdm_reg_info(&cxl->cxlds, &count, &offset, &size);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + if (!count || !size)
> > > + return -ENODEV;
> > > +
> > > + cxl->hdm_count = count;
> > > + cxl->hdm_reg_offset = offset;
> > > + cxl->hdm_reg_size = size;
> > > +
> > > + ret = cxl_regblock_get_bar_info(map, &bar, &bar_offset);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + cxl->comp_reg_bar = bar;
> > > + cxl->comp_reg_offset = bar_offset;
> > > + cxl->comp_reg_size = CXL_COMPONENT_REG_BLOCK_SIZE;
> > > +
> > > + return 0;
> > > +}
> > > +
> > > /**
> > > * vfio_pci_cxl_detect_and_init - Detect and initialize CXL Type-2
> device
> > > * @vdev: VFIO PCI device
> > > @@ -28,8 +136,51 @@ MODULE_IMPORT_NS("CXL");
> > > */
> > > void vfio_pci_cxl_detect_and_init(struct vfio_pci_core_device
> > > *vdev) {
> > > + struct pci_dev *pdev = vdev->pdev;
> > > + struct vfio_pci_cxl_state *cxl;
> > > + u16 dvsec;
> > > + int ret;
> > > +
> > > + if (!pcie_is_cxl(pdev))
> > > + return;
> > > +
> > > + dvsec = pci_find_dvsec_capability(pdev,
> > > + PCI_VENDOR_ID_CXL,
> > > + PCI_DVSEC_CXL_DEVICE);
> > > + if (!dvsec)
> > > + return;
> > > +
> > > + ret = vfio_cxl_create_device_state(vdev, dvsec);
> Suggestion above would lead to.
> cxl = vfio_cxl_create_device_state(vdev, dvsec);
> if (IS_ERR(cxl))
> return PTR_ERR(cxl); //assuming failing at this point is
> fatal.
>
> then only set vdev->cxl once you are sure this function succeeded.
> Thus removing the need to set it to NULL on failure.
> You need to pass it into a few more calls though.

Yes, I have added a local variable for this purpose.

>
>
> > > + if (ret)
> > > + return;
> > > +
> > > + cxl = vdev->cxl;
> > > +
> > > + /*
> > > + * Required for ioremap of the component register block and
> > > + * calls to cxl_probe_component_regs().
> > > + */
> > > + ret = pci_enable_device_mem(pdev);
> > > + if (ret)
> > > + goto failed;
> > > +
> > > + ret = vfio_cxl_setup_regs(vdev);
> > > + if (ret) {
> > > + pci_disable_device(pdev);
> > > + goto failed;
> > > + }
> > > +
> > > + pci_disable_device(pdev);
> > > +
> AS above, I think this would be cleaner if only here do we have
>
> vdev->cxl = cxl;
>
> > > + return;
> > > +
> > > +failed:
> > > + devm_kfree(&pdev->dev, vdev->cxl);
>
> If you get here you found a CXL device but couldn't handle it. Is it
> useful
> to continue? I'd suggest probably not. If returning an error then devm
> cleanup happens. Also, we should have a wrapper around devm_kfree to make
> it clear that it is valid to use it for undoing the
> devm_cxl_dev_state_create() (I haven't even checked it is valid)
>

Added a wrapper around devm_kfree.

> > > + vdev->cxl = NULL;
> > > }
> if /* __LINUX_VFIO_CXL_PRIV_H */
> >
> >