Re: [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support
From: Dan Williams
Date: Tue Apr 14 2026 - 00:09:41 EST
Forgive me if any of the commentary below was already hashed out in the
v1 discussion. Your excellent changelog notes make catching up much
easier, thanks!
mhonap@ wrote:
> From: Manish Honap <mhonap@xxxxxxxxxx>
>
> CXL Type-2 accelerators (e.g. CXL.mem-capable GPUs) cannot be passed
> through to virtual machines with stock vfio-pci because the driver has
> no concept of HDM decoder management, DPA region exposure, or component
> register emulation. This series wires all of that into vfio-pci-core
> behind a new CONFIG_VFIO_CXL_CORE optional module, without requiring a
> variant driver.
>
> When a CXL Device DVSEC (Vendor ID 0x1E98, ID 0x0000) is detected at
> device open time, the driver:
>
> - Probes the HDM Decoder Capability block in the component registers
> and allocates a DPA region through the CXL subsystem. On devices
> where firmware has already committed a decoder, the kernel skips
> allocation and re-uses the committed range.
>
> - Builds a kernel-owned shadow of the HDM register block. The VMM
> reads and writes this shadow through a dedicated COMP_REGS VFIO
> region rather than touching the hardware directly. The kernel
> enforces CXL 3.1 bit-field rules: reserved bits, read-only bits,
> the COMMIT/COMMITTED latch, and the LOCK→0 reprogram path for
> firmware-committed decoders.
>
> - Exposes the DPA range as a second VFIO region (VFIO_REGION_SUBTYPE_CXL)
> backed by the kernel-assigned HPA. PTEs are inserted lazily on first
> page fault and torn down atomically under memory_lock during FLR.
I assume, or hope this means expose a CXL region as
VFIO_REGION_SUBTYPE_CXL, as DPA is a device-internal address space that
VFIO probably does not need to worry about. VFIO likely only needs to
care about system visible resource.
If / when interleaving arrives for CXL accelerators the 1:1 vfio-pci to
DPA to CXL region HPA association breaks. Ok, to assume 1:1 for now.
> - Intercepts writes to the CXL DVSEC configuration-space registers
> (Control, Status, Control2, Status2, Lock, Range Base) and replays
Range Base is ignored when global HDM Decoder Control is enabled. I
would hope that this enabling ditches CXL 1.x legacy wherever possible.
> them through a per-device vconfig shadow, enforcing RWL/RW1CS/RWO
> access semantics and the CONFIG_LOCK one-shot latch.
Linux should have no need to ever trigger CXL register bit locks. That
is only for firmware to make changes immutable if the firmware has
requirements that nothing moves for its own purposes.
Now, it makes sense to configure the vCXL device to be locked at setup,
but I do not currently see the use case for the vBIOS to mutate and lock
the configuration.
[..]
> - Includes selftests
Yay!
> covering device detection, capability parsing,
> region enumeration, HDM register emulation, DPA mmap with page-fault
> insertion, FLR invalidation, and DVSEC register emulation.
>
> The series is applied on top of the cxl/next branch using the base
> specified at the end of this cover letter plus Alejandro's v23 Type-2
> device support patches [1].
One of the sticking points of the accelerator series has been how many
details of the CXL core internal object lifetime leak out.
My hope / thought experiment is that the initial version of this
enabling only needs to facilitate getting a VMM established CXL region
into a guest. With that VFIO only needs is the CXL region HPA and MMIO
layout so that CXL registers can be trapped and non-CXL registers can be
direct mapped.
> Series structure
> ================
>
> Patches 1-5 extend the CXL subsystem with the APIs vfio-pci needs.
>
> Patches 6-8 add the vfio-pci-core plumbing (UAPI, device state,
> Kconfig/build).
>
> Patches 9-15 implement the core device lifecycle: detection, HDM
> emulation, media readiness, region management, DPA region, and DVSEC
> emulation.
>
> Patches 16-18 wire everything together at open/close time and
> populate the VFIO ioctl paths.
>
> Patches 19-20 add documentation and selftests.
>
> Changes since v1
> ================
[..]
> HDM API simplification (patch 1)
>
> v1 exported cxl_get_hdm_reg_info() which returned a raw struct with
> offset and size fields. v2 replaces it with cxl_get_hdm_info() which
> uses the cached count already populated by cxl_probe_component_regs()
> and returns a single struct with all HDM metadata, removing the need
> for callers to re-read the hardware.
What is the accelerator use case to support multiple CXL regions per
device?
In other words, it feels ambitious to support that while simultaneously
kicking the "interleave" question down the road. If we are going for
initial simplicity that also means single region to start.
> cxl_await_range_active() split (patch 4)
>
> cxl_await_media_ready() requires a CXLMDEV mailbox register, which
> Type-2 accelerators may not have. v2 splits out cxl_await_range_active()
> so the HDM range-active poll can be used independently of the media
> ready path.
This feels like a detail vfio-pci does not need to worry about. The core
knows that the device does not have a mailbox and the core knows it
needs to await range ready when probing HDM. Something is broken if
vfio-pci needs to duplicate this part of the setup.
> LOCK→0 transition in HDM ctrl write emulation (patch 11)
>
> v1 did not handle the case where a guest tries to clear the LOCK bit
> to reprogram a firmware-committed decoder. v2 allows this transition
> and re-programs the hardware accordingly.
? Guest has no ability to manipulate Host HPA mappings. A protocol for a
guest to work with a host to remap HPA does not sound like a v1
requirement. This would be equivalent to a guest asking to move a host
PCI BAR.