[PATCH 18/20] docs: vfio-pci: Document CXL Type-2 device passthrough

From: mhonap

Date: Wed Mar 11 2026 - 16:41:12 EST


From: Manish Honap <mhonap@xxxxxxxxxx>

Add a driver-api document describing the architecture, interfaces, and
operational constraints of CXL Type-2 device passthrough via vfio-pci-core.

CXL Type-2 devices (cache-coherent accelerators such as GPUs with attached
device memory) present unique passthrough requirements not covered by the
existing vfio-pci documentation:

- The host kernel retains ownership of the HDM decoder hardware through
the CXL subsystem, so the guest cannot program decoders directly.
- Two additional VFIO device regions expose the emulated HDM register
state (COMP_REGS) and the DPA memory window (DPA region) to userspace.
- DVSEC configuration space writes are intercepted and virtualized so
that the guest cannot alter host-owned CXL.io / CXL.mem enable bits.
- Device reset (FLR) is coordinated through vfio_pci_ioctl_reset(): all
DPA PTEs are zapped before the reset and restored afterward.

Signed-off-by: Manish Honap <mhonap@xxxxxxxxxx>
---
Documentation/driver-api/index.rst | 1 +
Documentation/driver-api/vfio-pci-cxl.rst | 216 ++++++++++++++++++++++
2 files changed, 217 insertions(+)
create mode 100644 Documentation/driver-api/vfio-pci-cxl.rst

diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst
index 1833e6a0687e..7ec661846f6b 100644
--- a/Documentation/driver-api/index.rst
+++ b/Documentation/driver-api/index.rst
@@ -47,6 +47,7 @@ of interest to most developers working on device drivers.
vfio-mediated-device
vfio
vfio-pci-device-specific-driver-acceptance
+ vfio-pci-cxl

Bus-level documentation
=======================
diff --git a/Documentation/driver-api/vfio-pci-cxl.rst b/Documentation/driver-api/vfio-pci-cxl.rst
new file mode 100644
index 000000000000..f2cbe2fdb036
--- /dev/null
+++ b/Documentation/driver-api/vfio-pci-cxl.rst
@@ -0,0 +1,216 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+====================================================
+VFIO PCI CXL Type-2 Device Passthrough
+====================================================
+
+Overview
+--------
+
+CXL (Compute Express Link) Type-2 devices are cache-coherent PCIe accelerators
+and GPUs that attach their own volatile memory (Device Physical Address space,
+or DPA) to the host memory fabric via the CXL protocol. Examples include
+GPU/accelerator cards that expose coherent device memory to the host.
+
+When such a device is passthroughed to a virtual machine using ``vfio-pci``,
+the kernel CXL subsystem must remain in control of the Host-managed Device
+Memory (HDM) decoders that map the device's DPA into the host physical address
+(HPA) space. A VMM such as QEMU cannot program HDM decoders directly; instead
+it uses a set of VFIO-specific regions and UAPI extensions described here.
+
+This support is compiled in when ``CONFIG_VFIO_CXL_CORE=y``. It can be
+disabled at module load time for all devices bound to ``vfio-pci`` with::
+
+ modprobe vfio-pci disable_cxl=1
+
+Variant drivers can disable CXL extensions for individual devices by setting
+``vdev->disable_cxl = true`` in their probe function before registration.
+
+Device Detection
+----------------
+
+CXL Type-2 detection happens automatically when ``vfio-pci`` registers a
+device that has:
+
+1. A CXL Device DVSEC capability (PCIe DVSEC Vendor ID 0x1E98, ID 0x0000).
+2. Bit 2 (Mem_Capable) set in the CXL Capability register within that DVSEC.
+3. A PCI class code that is **not** ``0x050210`` (CXL Type-3 memory device).
+4. An HDM Decoder block discoverable via the Register Locator DVSEC.
+5. A pre-committed HDM decoder (BIOS/firmware programmed) with non-zero size.
+
+On successful detection ``VFIO_DEVICE_FLAGS_CXL`` is set in
+``vfio_device_info.flags`` alongside ``VFIO_DEVICE_FLAGS_PCI``.
+
+UAPI Extensions
+---------------
+
+VFIO_DEVICE_GET_INFO Capability: VFIO_DEVICE_INFO_CAP_CXL
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When ``VFIO_DEVICE_FLAGS_CXL`` is set the device info capability chain
+contains a ``vfio_device_info_cap_cxl`` structure (cap ID 6)::
+
+ struct vfio_device_info_cap_cxl {
+ struct vfio_info_cap_header header; /* id=6, version=1 */
+ __u8 hdm_count; /* number of HDM decoders */
+ __u8 hdm_regs_bar_index; /* PCI BAR containing component registers */
+ __u16 pad;
+ __u32 flags; /* VFIO_CXL_CAP_* flags */
+ __u64 hdm_regs_size; /* size in bytes of the HDM decoder block */
+ __u64 hdm_regs_offset; /* byte offset within the BAR to HDM block */
+ __u64 dpa_size; /* total DPA size in bytes */
+ __u32 dpa_region_index; /* index of the DPA device region */
+ __u32 comp_regs_region_index; /* index of the COMP_REGS device region */
+ };
+
+Flags:
+
+``VFIO_CXL_CAP_COMMITTED`` (bit 0)
+ The HDM decoder was committed by the kernel CXL subsystem.
+
+``VFIO_CXL_CAP_PRECOMMITTED`` (bit 1)
+ The HDM decoder was pre-committed by host firmware/BIOS. The VMM does
+ not need to allocate CXL HPA space; the mapping is already live.
+
+VFIO Regions
+~~~~~~~~~~~~~
+
+A CXL Type-2 device exposes two additional device regions beyond the standard
+PCI BAR regions. Their indices are reported in ``dpa_region_index`` and
+``comp_regs_region_index`` in the capability structure.
+
+**DPA Region** (subtype ``VFIO_REGION_SUBTYPE_CXL``)
+ Flags: ``VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE |
+ VFIO_REGION_INFO_FLAG_MMAP``
+
+ Represents the device's DPA memory mapped at the kernel-assigned HPA.
+ The VMM should map this region with mmap() to expose device memory to the
+ guest. Page faults are handled lazily; the kernel inserts PFNs on first
+ access rather than at mmap() time. During FLR/reset all PTEs are
+ invalidated and the region becomes inaccessible until the reset completes.
+
+ Read and write access via the region file descriptor is also supported and
+ routes through a kernel-managed virtual address established with
+ ``ioremap_cache()``.
+
+**COMP_REGS Region** (subtype ``VFIO_REGION_SUBTYPE_CXL_COMP_REGS``)
+ Flags: ``VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE``
+ (no mmap).
+
+ An emulated, read/write-only region exposing the HDM decoder registers.
+ The kernel shadows the hardware HDM register state and enforces all
+ bit-field rules (reserved bits, read-only bits, commit semantics) on every
+ write. Only 32-bit aligned, 32-bit wide accesses are permitted, matching
+ the hardware requirement.
+
+ The VMM uses this region to read and write HDM decoder BASE, SIZE, and
+ CTRL registers. Setting the COMMIT bit (bit 9) in a CTRL register causes
+ the kernel to immediately set the COMMITTED bit (bit 10) in the emulated
+ shadow state, allowing the VMM to detect the transition via a
+ ``notify_change`` callback.
+
+ The component register BAR itself (``hdm_regs_bar_index``) is hidden:
+ ``VFIO_DEVICE_GET_REGION_INFO`` for that BAR index returns ``size = 0``.
+ All HDM access must go through the COMP_REGS region.
+
+Region Type Identifiers::
+
+ /* type = PCI_VENDOR_ID_CXL | VFIO_REGION_TYPE_PCI_VENDOR_TYPE (0x80001e98) */
+ #define VFIO_REGION_SUBTYPE_CXL 1 /* DPA memory region */
+ #define VFIO_REGION_SUBTYPE_CXL_COMP_REGS 2 /* HDM register region */
+
+DVSEC Configuration Space Emulation
+-------------------------------------
+
+When ``CONFIG_VFIO_CXL_CORE=y`` the kernel installs a CXL-aware write handler
+for the ``PCI_EXT_CAP_ID_DVSEC`` (0x23) extended capability entry in the vfio-pci
+configuration space permission table. This handler runs for every device
+opened under ``vfio-pci``; for non-CXL devices it falls through to the
+hardware write path unchanged.
+
+For CXL devices, writes to the following DVSEC registers are intercepted and
+emulated in ``vdev->vconfig`` (the per-device shadow configuration space):
+
++--------------------+--------+-------------------------------------------+
+| Register | Offset | Emulation |
++====================+========+===========================================+
+| CXL Control | 0x0c | RWL semantics; IO_Enable forced to 1; |
+| | | locked after Lock register bit 0 is set. |
++--------------------+--------+-------------------------------------------+
+| CXL Status | 0x0e | Bit 14 (Viral_Status) is RW1CS. |
++--------------------+--------+-------------------------------------------+
+| CXL Control2 | 0x10 | Bits 0, 3 forwarded to hardware; bits |
+| | | 1 and 2 trigger subsystem actions. |
++--------------------+--------+-------------------------------------------+
+| CXL Status2 | 0x12 | Bit 3 (RW1CS) forwarded to hardware when |
+| | | Capability3 bit 3 is set. |
++--------------------+--------+-------------------------------------------+
+| CXL Lock | 0x14 | RWO; once set, Control becomes read-only |
+| | | until conventional reset. |
++--------------------+--------+-------------------------------------------+
+| Range Base High/Lo | varies | Stored in vconfig; Base Low [27:0] |
+| | | reserved bits cleared. |
++--------------------+--------+-------------------------------------------+
+
+Reads of these registers return the emulated vconfig values. Read-only
+registers (Capability, Size registers, range Size High/Low) are also served
+from vconfig, which was seeded from hardware at device open time.
+
+FLR and Reset Behaviour
+-----------------------
+
+During Function Level Reset (FLR):
+
+1. ``vfio_cxl_zap_region_locked()`` is called under the write side of
+ ``memory_lock``. It sets ``region_active = false`` and calls
+ ``unmap_mapping_range()`` to invalidate all DPA region PTEs.
+
+2. Any concurrent page fault or ``read()``/``write()`` on the DPA region
+ sees ``region_active = false`` and returns ``VM_FAULT_SIGBUS`` or ``-EIO``
+ respectively.
+
+3. After reset completes, ``vfio_cxl_reactivate_region()`` re-reads the HDM
+ decoder state from hardware into ``comp_reg_virt[]`` (it will typically
+ be all-zeros after FLR) and sets ``region_active = true`` only if the
+ COMMITTED bit is set in the freshly re-snapshotted hardware state for
+ pre-committed decoders. The VMM may re-fault into the DPA region without
+ issuing a new ``mmap()`` call. Each newly faulted page is scrubbed via
+ ``memset_io()`` before the PFN is inserted.
+
+VMM Integration Notes
+---------------------
+
+A VMM integrating CXL Type-2 passthrough should:
+
+1. Issue ``VFIO_DEVICE_GET_INFO`` and check ``VFIO_DEVICE_FLAGS_CXL``.
+2. Walk the capability chain to find ``VFIO_DEVICE_INFO_CAP_CXL`` (id = 6).
+3. Record ``dpa_region_index``, ``comp_regs_region_index``, ``dpa_size``,
+ ``hdm_count``, ``hdm_regs_offset``, and ``hdm_regs_size``.
+4. Map the DPA region (``dpa_region_index``) with mmap() to a guest physical
+ address. The region supports ``PROT_READ | PROT_WRITE``.
+5. Open the COMP_REGS region (``comp_regs_region_index``) and attach a
+ ``notify_change`` callback to detect COMMIT transitions. When bit 10
+ (COMMITTED) transitions from 0 to 1 in a CTRL register read, the VMM
+ should expose the corresponding DPA range to the guest and map the
+ relevant slice of the DPA mmap.
+6. For pre-committed devices (``VFIO_CXL_CAP_PRECOMMITTED`` set) the entire
+ DPA is already mapped and the VMM need not wait for a guest COMMIT.
+7. Program the guest CXL DVSEC registers (via VFIO config space write) to
+ reflect the guest's view. The kernel emulates all register semantics
+ including the CONFIG_LOCK one-shot latch.
+
+Kernel Configuration
+--------------------
+
+``CONFIG_VFIO_CXL_CORE`` (bool)
+ Enable CXL Type-2 passthrough support in ``vfio-pci-core``.
+ Depends on ``CONFIG_VFIO_PCI_CORE``, ``CONFIG_CXL_BUS``, and
+ ``CONFIG_CXL_MEM``.
+
+References
+----------
+
+* CXL Specification 3.1, §8.1.3 — DVSEC for CXL Devices
+* CXL Specification 3.1, §8.2.4.20 — CXL HDM Decoder Capability Structure
+* ``include/uapi/linux/vfio.h`` — ``VFIO_DEVICE_INFO_CAP_CXL``,
+ ``VFIO_REGION_SUBTYPE_CXL``, ``VFIO_REGION_SUBTYPE_CXL_COMP_REGS``
--
2.25.1