RE: [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device passthrough support

From: Manish Honap

Date: Fri Apr 17 2026 - 11:48:58 EST




> -----Original Message-----
> From: Dan Williams <djbw@xxxxxxxxxx>
> Sent: 14 April 2026 09:39
> To: Manish Honap <mhonap@xxxxxxxxxx>; Alex Williamson
> <alwilliamson@xxxxxxxxxx>; jonathan.cameron@xxxxxxxxxx;
> dave.jiang@xxxxxxxxx; alejandro.lucero-palau@xxxxxxx; dave@xxxxxxxxxxxx;
> alison.schofield@xxxxxxxxx; vishal.l.verma@xxxxxxxxx;
> ira.weiny@xxxxxxxxx; dmatlack@xxxxxxxxxx; shuah@xxxxxxxxxx;
> jgg@xxxxxxxx; Yishai Hadas <yishaih@xxxxxxxxxx>; Shameer Kolothum Thodi
> <skolothumtho@xxxxxxxxxx>; kevin.tian@xxxxxxxxx; Ankit Agrawal
> <ankita@xxxxxxxxxx>
> Cc: Vikram Sethi <vsethi@xxxxxxxxxx>; Neo Jia <cjia@xxxxxxxxxx>; Tarun
> Gupta (SW-GPU) <targupta@xxxxxxxxxx>; Zhi Wang <zhiw@xxxxxxxxxx>;
> Krishnakant Jaju <kjaju@xxxxxxxxxx>; linux-kselftest@xxxxxxxxxxxxxxx;
> linux-kernel@xxxxxxxxxxxxxxx; linux-cxl@xxxxxxxxxxxxxxx;
> kvm@xxxxxxxxxxxxxxx; Manish Honap <mhonap@xxxxxxxxxx>; Alex Williamson
> <alex@xxxxxxxxxxx>; Jonathan Cameron <Jonathan.Cameron@xxxxxxxxxx>
> Subject: Re: [PATCH v2 00/20] vfio/pci: Add CXL Type-2 device
> passthrough support
>
> External email: Use caution opening links or attachments
>
>
> Forgive me if any of the commentary below was already hashed out in the
> v1 discussion. Your excellent changelog notes make catching up much
> easier, thanks!
>
> mhonap@ wrote:
> > From: Manish Honap <mhonap@xxxxxxxxxx>
> >
> > CXL Type-2 accelerators (e.g. CXL.mem-capable GPUs) cannot be passed
> > through to virtual machines with stock vfio-pci because the driver has
> > no concept of HDM decoder management, DPA region exposure, or
> > component register emulation. This series wires all of that into
> > vfio-pci-core behind a new CONFIG_VFIO_CXL_CORE optional module,
> > without requiring a variant driver.
> >
> > When a CXL Device DVSEC (Vendor ID 0x1E98, ID 0x0000) is detected at
> > device open time, the driver:
> >
> > - Probes the HDM Decoder Capability block in the component registers
> > and allocates a DPA region through the CXL subsystem. On devices
> > where firmware has already committed a decoder, the kernel skips
> > allocation and re-uses the committed range.
> >
> > - Builds a kernel-owned shadow of the HDM register block. The VMM
> > reads and writes this shadow through a dedicated COMP_REGS VFIO
> > region rather than touching the hardware directly. The kernel
> > enforces CXL 3.1 bit-field rules: reserved bits, read-only bits,
> > the COMMIT/COMMITTED latch, and the LOCK→0 reprogram path for
> > firmware-committed decoders.
> >
> > - Exposes the DPA range as a second VFIO region
> (VFIO_REGION_SUBTYPE_CXL)
> > backed by the kernel-assigned HPA. PTEs are inserted lazily on
> first
> > page fault and torn down atomically under memory_lock during FLR.
>
> I assume, or hope this means expose a CXL region as
> VFIO_REGION_SUBTYPE_CXL, as DPA is a device-internal address space that
> VFIO probably does not need to worry about. VFIO likely only needs to
> care about system visible resource.

Fair catch - that was incorrect wording. DPA is just what we hand to the
CXL subsystem during allocation setup; guests never see it.
I will fix the cover letter and any comments with the same mistake so
that docs do not imply VFIO is exporting DPA directly.

>
> If / when interleaving arrives for CXL accelerators the 1:1 vfio-pci to
> DPA to CXL region HPA association breaks. Ok, to assume 1:1 for now.

Okay, v2 is designed for single-region / non-interleaved case; I will
revisit association model in next round of review and update the
changelog to state this decision.

>
> > - Intercepts writes to the CXL DVSEC configuration-space registers
> > (Control, Status, Control2, Status2, Lock, Range Base) and replays
>
> Range Base is ignored when global HDM Decoder Control is enabled. I
> would hope that this enabling ditches CXL 1.x legacy wherever possible.

Noted on Range Base being ignored when global HDM decoder control is in play.
I will audit the emulation path and the doc for any wording that sounds like
we depend on legacy range-base behavior when global HDM is enabled.
In the next review round, I will drop the Range Base handling from DVSEC
emulation codepath.

>
> > them through a per-device vconfig shadow, enforcing RWL/RW1CS/RWO
> > access semantics and the CONFIG_LOCK one-shot latch.
>
> Linux should have no need to ever trigger CXL register bit locks. That
> is only for firmware to make changes immutable if the firmware has
> requirements that nothing moves for its own purposes.
>
> Now, it makes sense to configure the vCXL device to be locked at setup,
> but I do not currently see the use case for the vBIOS to mutate and lock
> the configuration.
>
> [..]
> > - Includes selftests
>
> Yay!

Thank you! I will keep extending them as the UAPI surface stabilizes.

>
> > covering device detection, capability parsing,
> > region enumeration, HDM register emulation, DPA mmap with page-
> fault
> > insertion, FLR invalidation, and DVSEC register emulation.
> >
> > The series is applied on top of the cxl/next branch using the base
> > specified at the end of this cover letter plus Alejandro's v23 Type-2
> > device support patches [1].
>
> One of the sticking points of the accelerator series has been how many
> details of the CXL core internal object lifetime leak out.
>
> My hope / thought experiment is that the initial version of this
> enabling only needs to facilitate getting a VMM established CXL region
> into a guest. With that VFIO only needs is the CXL region HPA and MMIO
> layout so that CXL registers can be trapped and non-CXL registers can be
> direct mapped.

Okay, I will investigate this part.

>
> > Series structure
> > ================
> >
> > Patches 1-5 extend the CXL subsystem with the APIs vfio-pci needs.
> >
> > Patches 6-8 add the vfio-pci-core plumbing (UAPI, device state,
> > Kconfig/build).
> >
> > Patches 9-15 implement the core device lifecycle: detection, HDM
> > emulation, media readiness, region management, DPA region, and DVSEC
> > emulation.
> >
> > Patches 16-18 wire everything together at open/close time and
> > populate the VFIO ioctl paths.
> >
> > Patches 19-20 add documentation and selftests.
> >
> > Changes since v1
> > ================
> [..]
> > HDM API simplification (patch 1)
> >
> > v1 exported cxl_get_hdm_reg_info() which returned a raw struct with
> > offset and size fields. v2 replaces it with cxl_get_hdm_info() which
> > uses the cached count already populated by
> cxl_probe_component_regs()
> > and returns a single struct with all HDM metadata, removing the need
> > for callers to re-read the hardware.
>
> What is the accelerator use case to support multiple CXL regions per
> device?

For this version there isn't one. One committed decoder, one contiguous
Region and restricted to decoder 0. I will think about addition of these
aspects.

>
> In other words, it feels ambitious to support that while simultaneously
> kicking the "interleave" question down the road. If we are going for
> initial simplicity that also means single region to start.
>
> > cxl_await_range_active() split (patch 4)
> >
> > cxl_await_media_ready() requires a CXLMDEV mailbox register, which
> > Type-2 accelerators may not have. v2 splits out
> cxl_await_range_active()
> > so the HDM range-active poll can be used independently of the media
> > ready path.
>
> This feels like a detail vfio-pci does not need to worry about. The core
> knows that the device does not have a mailbox and the core knows it
> needs to await range ready when probing HDM. Something is broken if
> vfio-pci needs to duplicate this part of the setup.

Okay, I'll send an RFC to linux-cxl for this and refactor the patches in
current series.

>
> > LOCK→0 transition in HDM ctrl write emulation (patch 11)
> >
> > v1 did not handle the case where a guest tries to clear the LOCK bit
> > to reprogram a firmware-committed decoder. v2 allows this transition
> > and re-programs the hardware accordingly.
>
> ? Guest has no ability to manipulate Host HPA mappings. A protocol for a
> guest to work with a host to remap HPA does not sound like a v1
> requirement. This would be equivalent to a guest asking to move a host
> PCI BAR.

Okay, agreed. For initial support addition, I will drop the lock programming
support for guest in next review. If in future we require HDM remapping,
we can think of a separate mechanism for this instead of config-space
writes.

Manish