Re: [PATCH 0/5] PCI/CXL: Save and restore CXL DVSEC and HDM state across resets

From: Alex Williamson

Date: Thu Apr 02 2026 - 17:02:47 EST


Hey Dan,

On Wed, 1 Apr 2026 18:12:19 -0700
Dan Williams <dan.j.williams@xxxxxxxxx> wrote:

> Alex Williamson wrote:
>
> Hey Alex, sorry for the lag in responding here...
>
> > On Tue, 17 Mar 2026 10:03:28 -0700
> > Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
> >
> > > Manish Honap wrote:
> > > [..]
> > > > > The CXL accelerator series is currently contending with being able to
> > > > > restore device configuration after reset. I expect vfio-cxl to build on
> > > > > that, not push CXL flows into the PCI core.
> > > >
> > > > Hello Dan,
> > > >
> > > > My VFIO CXL Type-2 passthrough series [1] takes a position on this that I
> > > > would like to explain because I expect you will have similar concerns about
> > > > it and I'd rather have this conversation now.
> > > >
> > > > Type-2 passthrough series takes the opposite structural approach as you are
> > > > suggesting here: CXL Type-2 support is an optional extension compiled into
> > > > vfio-pci-core (CONFIG_VFIO_CXL_CORE), not a separate driver.
> > > >
> > > > Here is the reasoning:
> > > >
> > > > 1. Device enumeration
> > > > =====================
> > > >
> > > > CXL Type-2 devices (GPU + accelerator class) are enumerated as struct pci_dev
> > > > objects. The kernel discovers them through PCI config space scan, not through
> > > > the CXL bus. The CXL capability is advertised via the DVSEC (PCI_EXT_CAP_ID
> > > > 0x23, Vendor ID 0x1E98), which is PCI config space. There is no CXL bus
> > > > device to bind to.
> > > >
> > > > A standalone vfio-cxl driver would therefore need to match on the PCI device
> > > > just like vfio-pci does, and then call into vfio-pci-core for every PCI
> > > > concern: config space emulation, BAR region handling, MSI/MSI-X, INTx, DMA
> > > > mapping, FLR, and migration callbacks. That is the variant driver pattern
> > > > we rejected in favour of generic CXL passthrough. We have seen this exact
> > >
> > > Lore link for this "rejection" discussion?
> > >
> > > > outcome with the prior iterations of this series before we moved to the
> > > > enlightened vfio-pci model.
> > >
> > > I still do not understand the argument. CXL functionality is a library
> > > that PCI drivers can use.
> >
> [..]
> > If we were to make "vfio-cxl" as a vfio-pci variant driver, we'd need
> > to expand the ID table for specific devices, which becomes a
> > maintenance issue. Otherwise userspace would need to detect the CXL
> > capabilities and override the automatic driver aliases. We can't match
> > drivers based on DVSEC capabilities and we don't have any protocol to
> > define a "2nd best" match for a device alias if probe fails.
>
> I can see the argument, and why it makes sense to attempt this way
> first. Point conceded.
>
> Now a follow on concern is the plan to manage a case of "PCI operation
> is available, but CXL operation is not. Does the driver proceed?" Put
> another way, I immediately see how to convey the policy of "continue
> without CXL" when there is an explicit driver distinction, but it is
> ambiguous with an enlightened vfio-pci driver.

As an enlightenment to vfio-pci, CXL support must in all cases degrade
to PCI support. Manish's series proposes a new flag bit in the
DEVICE_INFO ioctl for CXL (type2 specifically) that would be used in
combination with the existing PCI flag. If both are set, it's a PCI
device with CXL.{mem,cache} capability, otherwise only PCI would be set.

> > > If vfio-pci functionality is also a library
> > > then vfio-cxl is a driver that uses services from both libraries. Where
> > > the module and driver name boundaries are drawn is more an organization
> > > decision not an functional one.
> >
> > But as above, it is functional. Someone needs to define when to use
> > which driver, which leads to libvirt needing to specify whether a
> > device is being exposed as PCI or CXL, and the same understanding in
> > each VMM. OTOH, using vfio-pci as the basis and layering CXL feature
> > detection, ie. enlightenment, gives us a more compatible, incremental
> > approach.
>
> Ok, to make sure I understand the proposal: userspace still needs to to
> end up with knowledge of CXL operation, but that need not be resolved by
> module policy.

It's a single module as far as userspace is concerned, and the decision
lies with userspace whether to take advantage of the CXL features
indicated by the device flag.

> Userspace also just needs to be ok with the unsightliness of the CXL
> modules autoloading on systems without CXL.

I'm open to suggestions here. The current proposal will pull in CXL
modules regardless of having a CXL device.

We could build vfio_cxl_core as a module with an automatic
MODULE_SOFTDEP in vfio_pci_core. We could then do a symbol_get around
CXL code so that we never CXL enlighten a device if the module isn't
loaded, allowing userspace policy control via modprobe.d blacklists.
We could also use a registration mechanism from vfio-cxl-core to
vfio-pci-core to avoid symbol_gets.

> > > The argument for vfio-cxl organizational independence is more about
> > > being able to tell at a diffstat level the relative PCI vs CXL
> > > maintenance impact / regression risk.
> >
> > But we still have that. CXL enlightenment for vfio-pci(-core) can
> > still be configured out and compartmentalized into separate helper
> > library code.
>
> Yes, modulo some of the proposal here to enlighten the PCI core with CXL
> specifics that I want to give more scrutiny.
>
> > > > 2. CXL-CORE involvement
> > > > =======================
> > > >
> > > > CXL type-2 passthrough series does not bypass CXL core. At vfio_pci_probe()
> > > > time the CXL enlightenment layer:
> > > >
> > > > - calls cxl_get_hdm_info() to probe the HDM Decoder Capability block,
> > > > - calls cxl_get_committed_decoder() to locate pre-committed firmware regions,
> > > > - calls cxl_create_region() / cxl_request_dpa() for dynamic allocation,
> > > > - creates a struct cxl_memdev via the CXL core (via cxl_probe_component_regs,
> > > > the same path Alejandro's v23 series uses).
> > > >
> > > > The CXL core is fully involved. The difference is that the binding to
> > > > userspace is still through vfio-pci, which already manages the pci_dev
> > > > lifecycle, reset sequencing, and VFIO region/irq API.
> > >
> > > Sure, every CXL driver in the system will do the same.
> > >
> > > > 3. Standalone vfio-cxl
> > > > ======================
> > > >
> > > > To match the model you are suggesting, vfio-cxl would need to:
> > > >
> > > > (a) Register a new driver on the CXL bus (struct cxl_driver), probing
> > > > struct cxl_memdev or a new struct cxl_endpoint,
> > >
> > > What, why? Just like this patch was series was proposing extending the
> > > PCI core with additional common functionality the proposal is extend the
> > > CXL core object drivers with the same.
> >
> > I don't follow, what is the proposal?
>
> Implement features like CXL Reset as operations against CXL objects like
> memdevs and regions. For example, PCI reset does not consider management
> of cache coherent memory, and certainly not interleaved cache coherent
> memory. Other CXL drivers also benefit if these capabilities are
> centralized.

I think "CXL Reset as operations against CXL objects" is large already
proposed as [1]. However, it's specifically for type2 devices, so we
can ignore some of the complications, such as interleaved cache
coherence, of a type3 use case.

[1]https://lore.kernel.org/all/20260306092322.148765-1-smadhavan@xxxxxxxxxx/

> > > > (b) Re-implement or delegate everything vfio-pci-core provides — config
> > > > space, BAR regions, IRQs, DMA, FLR, and VFIO container management —
> > >
> > > What is the argument against a library?
> >
> > vfio-pci-core is already a library, the extensions to support CXL as an
> > enlightenment of vfio-pci is also a library. The issue is that a
> > vfio-cxl PCI driver module presents more issues than simply code
> > organization.
>
> Understood. As I conceded above my concerns are complications that a
> vfio-cxl module does not solve cleanly.
>
> > > > (c) present to userspace through a new device model distinct from
> > > > vfio-pci.
> > >
> > > CXL is a distinct operational model. What breaks if userspace is
> > > required to explicitly account for CXL passhthrough?
> >
> > The entire virtualization stack needs to gain an understanding of the
> > intended use case of the device rather than simply push a PCI device
> > with CXL capabilities out to the guest.
>
> Agree.
>
> > > > This is a significant new surface. QEMU's CXL passthrough support already
> > > > builds on vfio-pci: it receives the PCI device via VFIO, reads the
> > > > VFIO_DEVICE_INFO_CAP_CXL capability chain, and exposes the CXL topology.
> > > > A vfio-cxl object model would require non-trivial QEMU changes for something
> > > > that already works in the enlightened vfio-pci model.
> > >
> > > What specifically about a kernel code organization choice affects the
> > > QEMU implementation? A uAPI is kernel code organization agnostic.
> > >
> > > The concern is designing ourselves into a PCI corner when longterm QEMU
> > > benefits from understanding CXL objects. For example, CXL error handling
> > > / recovery is already well on its way to being performed in terms of CXL
> > > port objects.
> >
> > Are you suggesting that rather than using the PCI device as the basis
> > for assignment to a userspace driver or VM that we make each port
> > objects assignable and somehow collect them into configuration on top of
> > a PCI device? I don't think these port objects are isolated for such a
> > use case. I'd like to better understand how you envision this to work.
>
> No, simply that CXL operations relative to that assigned PCI device are
> serviced by the CXL core. The object to manage over reset is subject to
> CPU speculative reads and potentially interleave, I think it breaks the
> PCI expectations of local device scope operations.
>
> If CXL Reset in particular stays out of the PCI core it at least
> requires something CXL enlightened to be loaded, and at a minimum I do
> not think that "something CXL enlightened" should be the PCI core.
>
> There is a reason the CXL specification decided to block secondary bus
> reset by default.
>
> > The organization of the code in the kernel seems 90%+ the same whether
> > we enlighten vfio-pci to detect and expose CXL features or we create a
> > separate vfio-cxl PCI driver only for CXL devices, but the userspace
> > consequences are increased significantly.
>
> Agree.
>
> > > > 4. Module dependency
> > > > ====================
> > > >
> > > > Current solution: CONFIG_VFIO_CXL_CORE depends on CONFIG_CXL_BUS. We do not
> > > > add CXL knowledge to the PCI core;
> > >
> > > drivers/pci/cxl.c
> >
> > This is largely a consequence of CXL_BUS being a loadable module.
>
> Yes, the question is why does that matter for CXL enlightened operation?
> Simply do not burden the PCI core to learn all the CXL concerns.

How do we then proceed relative to save/restore of CXL state based on a
PCI reset? Should CXL core register a save/restore handler with PCI
core or does PCI core reach out for a symbol from CXL core to support
save/restore?

If CXL core is not loaded, are we ok with silently losing CXL state
across a PCI reset, ie. assume that state is unused currently and accept
the risk of losing preconfigured decoders?

Does PCI core need to be involved in suppressing SBR?

Thanks,
Alex