Re: [PATCH v4 00/28] DCD: Add support for Dynamic Capacity Devices (DCD)

From: Fan Ni
Date: Tue Oct 08 2024 - 18:57:48 EST


On Mon, Oct 07, 2024 at 06:16:06PM -0500, Ira Weiny wrote:
> A git tree of this series can be found here:
>
> https://github.com/weiny2/linux-kernel/tree/dcd-v4-2024-10-04
>
> Series info
> ===========
>

Hi Ira,

Based on current DC extent release logic, when the extent to release is
in use (for example, created a dax device), no response (4803h) will be sent.
Should we send a response with empty extent list instead?

Fan


> This series has 5 parts:
>
> Patch 1-3: Add %pra printk format for struct range
> Patch 4: Add core range_overlaps() function
> Patch 5-6: CXL clean up/prelim patches
> Patch 7-26: Core DCD support
> Patch 27-28: cxl_test support
>
> Background
> ==========
>
> A Dynamic Capacity Device (DCD) (CXL 3.1 sec 9.13.3) is a CXL memory
> device that allows memory capacity within a region to change
> dynamically without the need for resetting the device, reconfiguring
> HDM decoders, or reconfiguring software DAX regions.
>
> One of the biggest use cases for Dynamic Capacity is to allow hosts to
> share memory dynamically within a data center without increasing the
> per-host attached memory.
>
> The general flow for the addition or removal of memory is to have an
> orchestrator coordinate the use of the memory. Generally there are 5
> actors in such a system, the Orchestrator, Fabric Manager, the Logical
> device, the Host Kernel, and a Host User.
>
> Typical work flows are shown below.
>
> Orchestrator FM Device Host Kernel Host User
>
> | | | | |
> |-------------- Create region ----------------------->|
> | | | | |
> | | | |<-- Create ---|
> | | | | Region |
> |<------------- Signal done --------------------------|
> | | | | |
> |-- Add ----->|-- Add --->|--- Add --->| |
> | Capacity | Extent | Extent | |
> | | | | |
> | |<- Accept -|<- Accept -| |
> | | Extent | Extent | |
> | | | |<- Create --->|
> | | | | DAX dev |-- Use memory
> | | | | | |
> | | | | | |
> | | | |<- Release ---| <-+
> | | | | DAX dev |
> | | | | |
> |<------------- Signal done --------------------------|
> | | | | |
> |-- Remove -->|- Release->|- Release ->| |
> | Capacity | Extent | Extent | |
> | | | | |
> | |<- Release-|<- Release -| |
> | | Extent | Extent | |
> | | | | |
> |-- Add ----->|-- Add --->|--- Add --->| |
> | Capacity | Extent | Extent | |
> | | | | |
> | |<- Accept -|<- Accept -| |
> | | Extent | Extent | |
> | | | |<- Create ----|
> | | | | DAX dev |-- Use memory
> | | | | | |
> | | | |<- Release ---| <-+
> | | | | DAX dev |
> |<------------- Signal done --------------------------|
> | | | | |
> |-- Remove -->|- Release->|- Release ->| |
> | Capacity | Extent | Extent | |
> | | | | |
> | |<- Release-|<- Release -| |
> | | Extent | Extent | |
> | | | | |
> |-- Add ----->|-- Add --->|--- Add --->| |
> | Capacity | Extent | Extent | |
> | | | |<- Create ----|
> | | | | DAX dev |-- Use memory
> | | | | | |
> |-- Remove -->|- Release->|- Release ->| | |
> | Capacity | Extent | Extent | | |
> | | | | | |
> | | | (Release Ignored) | |
> | | | | | |
> | | | |<- Release ---| <-+
> | | | | DAX dev |
> |<------------- Signal done --------------------------|
> | | | | |
> | |- Release->|- Release ->| |
> | | Extent | Extent | |
> | | | | |
> | |<- Release-|<- Release -| |
> | | Extent | Extent | |
> | | | |<- Destroy ---|
> | | | | Region |
> | | | | |
>
> Implementation
> ==============
>
> The series still requires the creation of regions and DAX devices to be
> closely synchronized with the Orchestrator and Fabric Manager. The host
> kernel will reject extents if a region is not yet created. It also
> ignores extent release if memory is in use (DAX device created). These
> synchronizations are not anticipated to be an issue with real
> applications.
>
> In order to allow for capacity to be added and removed a new concept of
> a sparse DAX region is introduced. A sparse DAX region may have 0 or
> more bytes of available space. The total space depends on the number
> and size of the extents which have been added.
>
> Initially it is anticipated that users of the memory will carefully
> coordinate the surfacing of additional capacity with the creation of DAX
> devices which use that capacity. Therefore, the allocation of the
> memory to DAX devices does not allow for specific associations between
> DAX device and extent. This keeps allocations very similar to existing
> DAX region behavior.
>
> To keep the DAX memory allocation aligned with the existing DAX devices
> which do not have tags extents are not allowed to have tags. Future
> support for tags is planned.
>
> Great care was taken to keep the extent tracking simple. Some xarray's
> needed to be added but extra software objects were kept to a minimum.
>
> Region extents continue to be tracked as sub-devices of the DAX region.
> This ensures that region destruction cleans up all extent allocations
> properly.
>
> Some review tags were kept if a patch did not change.
>
> The major functionality of this series includes:
>
> - Getting the dynamic capacity (DC) configuration information from cxl
> devices
>
> - Configuring the DC partitions reported by hardware
>
> - Enhancing the CXL and DAX regions for dynamic capacity support
> a. Maintain a logical separation between hardware extents and
> software managed region extents. This provides an
> abstraction between the layers and should allow for
> interleaving in the future
>
> - Get hardware extent lists for endpoint decoders upon
> region creation.
>
> - Adjust extent/region memory available on the following events.
> a. Add capacity Events
> b. Release capacity events
>
> - Host response for add capacity
> a. do not accept the extent if:
> If the region does not exist
> or an error occurs realizing the extent
> b. If the region does exist
> realize a DAX region extent with 1:1 mapping (no
> interleave yet)
> c. Support the event more bit by processing a list of extents
> marked with the more bit together before setting up a
> response.
>
> - Host response for remove capacity
> a. If no DAX device references the extent; release the extent
> b. If a reference does exist, ignore the request.
> (Require FM to issue release again.)
>
> - Modify DAX device creation/resize to account for extents within a
> sparse DAX region
>
> - Trace Dynamic Capacity events for debugging
>
> - Add cxl-test infrastructure to allow for faster unit testing
> (See new ndctl branch for cxl-dcd.sh test[1])
>
> - Only support 0 value extent tags
>
> Fan Ni's upstream of Qemu DCD was used for testing.
>
> Remaining work:
>
> 1) Allow mapping to specific extents (perhaps based on
> label/tag)
> 1a) devise region size reporting based on tags
> 2) Interleave support
>
> Possible additional work depending on requirements:
>
> 1) Accept a new extent which extends (but overlaps) an existing
> extent(s)
> 2) Release extents when DAX devices are released if a release
> was previously seen from the device
> 3) Rework DAX device interfaces, memfd has been explored a bit
>
> [1] https://github.com/weiny2/ndctl/tree/dcd-region2-2024-10-01
>
> ---
> Major changes in v4:
> - iweiny: rebase to 6.12-rc
> - iweiny: Add qos data to regions
> - Jonathan: Fix up shared region detection
> - Jonathan/jgroves/djbw/iweiny: Ignore 0 value tags
> - iweiny: Change DCD partition sysfs entries to allow for qos class and
> additional parameters per partition
> - Petr/Andy: s/%par/%pra/
> - Andy: Share logic between printing struct resource and struct range
> - Link to v3: https://patch.msgid.link/20240816-dcd-type2-upstream-v3-0-7c9b96cba6d7@xxxxxxxxx
>
> ---
> Ira Weiny (14):
> test printk: Add very basic struct resource tests
> printk: Add print format (%pra) for struct range
> cxl/cdat: Use %pra for dpa range outputs
> range: Add range_overlaps()
> dax: Document dax dev range tuple
> cxl/pci: Delay event buffer allocation
> cxl/cdat: Gather DSMAS data for DCD regions
> cxl/region: Refactor common create region code
> cxl/events: Split event msgnum configuration from irq setup
> cxl/pci: Factor out interrupt policy check
> cxl/core: Return endpoint decoder information from region search
> dax/bus: Factor out dev dax resize logic
> tools/testing/cxl: Make event logs dynamic
> tools/testing/cxl: Add DC Regions to mock mem data
>
> Navneet Singh (14):
> cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
> cxl/mem: Read dynamic capacity configuration from the device
> cxl/core: Separate region mode from decoder mode
> cxl/region: Add dynamic capacity decoder and region modes
> cxl/hdm: Add dynamic capacity size support to endpoint decoders
> cxl/mem: Expose DCD partition capabilities in sysfs
> cxl/port: Add endpoint decoder DC mode support to sysfs
> cxl/region: Add sparse DAX region support
> cxl/mem: Configure dynamic capacity interrupts
> cxl/extent: Process DCD events and realize region extents
> cxl/region/extent: Expose region extent information in sysfs
> dax/region: Create resources on sparse DAX regions
> cxl/region: Read existing extents on region creation
> cxl/mem: Trace Dynamic capacity Event Record
>
> Documentation/ABI/testing/sysfs-bus-cxl | 120 +++-
> Documentation/core-api/printk-formats.rst | 13 +
> drivers/cxl/core/Makefile | 2 +-
> drivers/cxl/core/cdat.c | 52 +-
> drivers/cxl/core/core.h | 33 +-
> drivers/cxl/core/extent.c | 486 +++++++++++++++
> drivers/cxl/core/hdm.c | 213 ++++++-
> drivers/cxl/core/mbox.c | 605 ++++++++++++++++++-
> drivers/cxl/core/memdev.c | 130 +++-
> drivers/cxl/core/port.c | 13 +-
> drivers/cxl/core/region.c | 170 ++++--
> drivers/cxl/core/trace.h | 65 ++
> drivers/cxl/cxl.h | 122 +++-
> drivers/cxl/cxlmem.h | 131 +++-
> drivers/cxl/pci.c | 123 +++-
> drivers/dax/bus.c | 352 +++++++++--
> drivers/dax/bus.h | 4 +-
> drivers/dax/cxl.c | 72 ++-
> drivers/dax/dax-private.h | 47 +-
> drivers/dax/hmem/hmem.c | 2 +-
> drivers/dax/pmem.c | 2 +-
> fs/btrfs/ordered-data.c | 10 +-
> include/acpi/actbl1.h | 2 +
> include/cxl/event.h | 32 +
> include/linux/range.h | 7 +
> lib/test_printf.c | 70 +++
> lib/vsprintf.c | 55 +-
> tools/testing/cxl/Kbuild | 3 +-
> tools/testing/cxl/test/mem.c | 960 ++++++++++++++++++++++++++----
> 29 files changed, 3576 insertions(+), 320 deletions(-)
> ---
> base-commit: 9852d85ec9d492ebef56dc5f229416c925758edc
> change-id: 20230604-dcd-type2-upstream-0cd15f6216fd
>
> Best regards,
> --
> Ira Weiny <ira.weiny@xxxxxxxxx>
>

--
Fan Ni