Re: [PATCH v4 00/28] DCD: Add support for Dynamic Capacity Devices (DCD)
From: Fan Ni
Date: Tue Oct 08 2024 - 19:07:38 EST
On Tue, Oct 08, 2024 at 03:57:13PM -0700, Fan Ni wrote:
> On Mon, Oct 07, 2024 at 06:16:06PM -0500, Ira Weiny wrote:
> > A git tree of this series can be found here:
> >
> > https://github.com/weiny2/linux-kernel/tree/dcd-v4-2024-10-04
> >
> > Series info
> > ===========
> >
>
> Hi Ira,
>
> Based on current DC extent release logic, when the extent to release is
> in use (for example, created a dax device), no response (4803h) will be sent.
> Should we send a response with empty extent list instead?
>
> Fan
Oh. my bad. 4803h does not allow an empty extent list.
Fan
>
>
> > This series has 5 parts:
> >
> > Patch 1-3: Add %pra printk format for struct range
> > Patch 4: Add core range_overlaps() function
> > Patch 5-6: CXL clean up/prelim patches
> > Patch 7-26: Core DCD support
> > Patch 27-28: cxl_test support
> >
> > Background
> > ==========
> >
> > A Dynamic Capacity Device (DCD) (CXL 3.1 sec 9.13.3) is a CXL memory
> > device that allows memory capacity within a region to change
> > dynamically without the need for resetting the device, reconfiguring
> > HDM decoders, or reconfiguring software DAX regions.
> >
> > One of the biggest use cases for Dynamic Capacity is to allow hosts to
> > share memory dynamically within a data center without increasing the
> > per-host attached memory.
> >
> > The general flow for the addition or removal of memory is to have an
> > orchestrator coordinate the use of the memory. Generally there are 5
> > actors in such a system, the Orchestrator, Fabric Manager, the Logical
> > device, the Host Kernel, and a Host User.
> >
> > Typical work flows are shown below.
> >
> > Orchestrator FM Device Host Kernel Host User
> >
> > | | | | |
> > |-------------- Create region ----------------------->|
> > | | | | |
> > | | | |<-- Create ---|
> > | | | | Region |
> > |<------------- Signal done --------------------------|
> > | | | | |
> > |-- Add ----->|-- Add --->|--- Add --->| |
> > | Capacity | Extent | Extent | |
> > | | | | |
> > | |<- Accept -|<- Accept -| |
> > | | Extent | Extent | |
> > | | | |<- Create --->|
> > | | | | DAX dev |-- Use memory
> > | | | | | |
> > | | | | | |
> > | | | |<- Release ---| <-+
> > | | | | DAX dev |
> > | | | | |
> > |<------------- Signal done --------------------------|
> > | | | | |
> > |-- Remove -->|- Release->|- Release ->| |
> > | Capacity | Extent | Extent | |
> > | | | | |
> > | |<- Release-|<- Release -| |
> > | | Extent | Extent | |
> > | | | | |
> > |-- Add ----->|-- Add --->|--- Add --->| |
> > | Capacity | Extent | Extent | |
> > | | | | |
> > | |<- Accept -|<- Accept -| |
> > | | Extent | Extent | |
> > | | | |<- Create ----|
> > | | | | DAX dev |-- Use memory
> > | | | | | |
> > | | | |<- Release ---| <-+
> > | | | | DAX dev |
> > |<------------- Signal done --------------------------|
> > | | | | |
> > |-- Remove -->|- Release->|- Release ->| |
> > | Capacity | Extent | Extent | |
> > | | | | |
> > | |<- Release-|<- Release -| |
> > | | Extent | Extent | |
> > | | | | |
> > |-- Add ----->|-- Add --->|--- Add --->| |
> > | Capacity | Extent | Extent | |
> > | | | |<- Create ----|
> > | | | | DAX dev |-- Use memory
> > | | | | | |
> > |-- Remove -->|- Release->|- Release ->| | |
> > | Capacity | Extent | Extent | | |
> > | | | | | |
> > | | | (Release Ignored) | |
> > | | | | | |
> > | | | |<- Release ---| <-+
> > | | | | DAX dev |
> > |<------------- Signal done --------------------------|
> > | | | | |
> > | |- Release->|- Release ->| |
> > | | Extent | Extent | |
> > | | | | |
> > | |<- Release-|<- Release -| |
> > | | Extent | Extent | |
> > | | | |<- Destroy ---|
> > | | | | Region |
> > | | | | |
> >
> > Implementation
> > ==============
> >
> > The series still requires the creation of regions and DAX devices to be
> > closely synchronized with the Orchestrator and Fabric Manager. The host
> > kernel will reject extents if a region is not yet created. It also
> > ignores extent release if memory is in use (DAX device created). These
> > synchronizations are not anticipated to be an issue with real
> > applications.
> >
> > In order to allow for capacity to be added and removed a new concept of
> > a sparse DAX region is introduced. A sparse DAX region may have 0 or
> > more bytes of available space. The total space depends on the number
> > and size of the extents which have been added.
> >
> > Initially it is anticipated that users of the memory will carefully
> > coordinate the surfacing of additional capacity with the creation of DAX
> > devices which use that capacity. Therefore, the allocation of the
> > memory to DAX devices does not allow for specific associations between
> > DAX device and extent. This keeps allocations very similar to existing
> > DAX region behavior.
> >
> > To keep the DAX memory allocation aligned with the existing DAX devices
> > which do not have tags extents are not allowed to have tags. Future
> > support for tags is planned.
> >
> > Great care was taken to keep the extent tracking simple. Some xarray's
> > needed to be added but extra software objects were kept to a minimum.
> >
> > Region extents continue to be tracked as sub-devices of the DAX region.
> > This ensures that region destruction cleans up all extent allocations
> > properly.
> >
> > Some review tags were kept if a patch did not change.
> >
> > The major functionality of this series includes:
> >
> > - Getting the dynamic capacity (DC) configuration information from cxl
> > devices
> >
> > - Configuring the DC partitions reported by hardware
> >
> > - Enhancing the CXL and DAX regions for dynamic capacity support
> > a. Maintain a logical separation between hardware extents and
> > software managed region extents. This provides an
> > abstraction between the layers and should allow for
> > interleaving in the future
> >
> > - Get hardware extent lists for endpoint decoders upon
> > region creation.
> >
> > - Adjust extent/region memory available on the following events.
> > a. Add capacity Events
> > b. Release capacity events
> >
> > - Host response for add capacity
> > a. do not accept the extent if:
> > If the region does not exist
> > or an error occurs realizing the extent
> > b. If the region does exist
> > realize a DAX region extent with 1:1 mapping (no
> > interleave yet)
> > c. Support the event more bit by processing a list of extents
> > marked with the more bit together before setting up a
> > response.
> >
> > - Host response for remove capacity
> > a. If no DAX device references the extent; release the extent
> > b. If a reference does exist, ignore the request.
> > (Require FM to issue release again.)
> >
> > - Modify DAX device creation/resize to account for extents within a
> > sparse DAX region
> >
> > - Trace Dynamic Capacity events for debugging
> >
> > - Add cxl-test infrastructure to allow for faster unit testing
> > (See new ndctl branch for cxl-dcd.sh test[1])
> >
> > - Only support 0 value extent tags
> >
> > Fan Ni's upstream of Qemu DCD was used for testing.
> >
> > Remaining work:
> >
> > 1) Allow mapping to specific extents (perhaps based on
> > label/tag)
> > 1a) devise region size reporting based on tags
> > 2) Interleave support
> >
> > Possible additional work depending on requirements:
> >
> > 1) Accept a new extent which extends (but overlaps) an existing
> > extent(s)
> > 2) Release extents when DAX devices are released if a release
> > was previously seen from the device
> > 3) Rework DAX device interfaces, memfd has been explored a bit
> >
> > [1] https://github.com/weiny2/ndctl/tree/dcd-region2-2024-10-01
> >
> > ---
> > Major changes in v4:
> > - iweiny: rebase to 6.12-rc
> > - iweiny: Add qos data to regions
> > - Jonathan: Fix up shared region detection
> > - Jonathan/jgroves/djbw/iweiny: Ignore 0 value tags
> > - iweiny: Change DCD partition sysfs entries to allow for qos class and
> > additional parameters per partition
> > - Petr/Andy: s/%par/%pra/
> > - Andy: Share logic between printing struct resource and struct range
> > - Link to v3: https://patch.msgid.link/20240816-dcd-type2-upstream-v3-0-7c9b96cba6d7@xxxxxxxxx
> >
> > ---
> > Ira Weiny (14):
> > test printk: Add very basic struct resource tests
> > printk: Add print format (%pra) for struct range
> > cxl/cdat: Use %pra for dpa range outputs
> > range: Add range_overlaps()
> > dax: Document dax dev range tuple
> > cxl/pci: Delay event buffer allocation
> > cxl/cdat: Gather DSMAS data for DCD regions
> > cxl/region: Refactor common create region code
> > cxl/events: Split event msgnum configuration from irq setup
> > cxl/pci: Factor out interrupt policy check
> > cxl/core: Return endpoint decoder information from region search
> > dax/bus: Factor out dev dax resize logic
> > tools/testing/cxl: Make event logs dynamic
> > tools/testing/cxl: Add DC Regions to mock mem data
> >
> > Navneet Singh (14):
> > cxl/mbox: Flag support for Dynamic Capacity Devices (DCD)
> > cxl/mem: Read dynamic capacity configuration from the device
> > cxl/core: Separate region mode from decoder mode
> > cxl/region: Add dynamic capacity decoder and region modes
> > cxl/hdm: Add dynamic capacity size support to endpoint decoders
> > cxl/mem: Expose DCD partition capabilities in sysfs
> > cxl/port: Add endpoint decoder DC mode support to sysfs
> > cxl/region: Add sparse DAX region support
> > cxl/mem: Configure dynamic capacity interrupts
> > cxl/extent: Process DCD events and realize region extents
> > cxl/region/extent: Expose region extent information in sysfs
> > dax/region: Create resources on sparse DAX regions
> > cxl/region: Read existing extents on region creation
> > cxl/mem: Trace Dynamic capacity Event Record
> >
> > Documentation/ABI/testing/sysfs-bus-cxl | 120 +++-
> > Documentation/core-api/printk-formats.rst | 13 +
> > drivers/cxl/core/Makefile | 2 +-
> > drivers/cxl/core/cdat.c | 52 +-
> > drivers/cxl/core/core.h | 33 +-
> > drivers/cxl/core/extent.c | 486 +++++++++++++++
> > drivers/cxl/core/hdm.c | 213 ++++++-
> > drivers/cxl/core/mbox.c | 605 ++++++++++++++++++-
> > drivers/cxl/core/memdev.c | 130 +++-
> > drivers/cxl/core/port.c | 13 +-
> > drivers/cxl/core/region.c | 170 ++++--
> > drivers/cxl/core/trace.h | 65 ++
> > drivers/cxl/cxl.h | 122 +++-
> > drivers/cxl/cxlmem.h | 131 +++-
> > drivers/cxl/pci.c | 123 +++-
> > drivers/dax/bus.c | 352 +++++++++--
> > drivers/dax/bus.h | 4 +-
> > drivers/dax/cxl.c | 72 ++-
> > drivers/dax/dax-private.h | 47 +-
> > drivers/dax/hmem/hmem.c | 2 +-
> > drivers/dax/pmem.c | 2 +-
> > fs/btrfs/ordered-data.c | 10 +-
> > include/acpi/actbl1.h | 2 +
> > include/cxl/event.h | 32 +
> > include/linux/range.h | 7 +
> > lib/test_printf.c | 70 +++
> > lib/vsprintf.c | 55 +-
> > tools/testing/cxl/Kbuild | 3 +-
> > tools/testing/cxl/test/mem.c | 960 ++++++++++++++++++++++++++----
> > 29 files changed, 3576 insertions(+), 320 deletions(-)
> > ---
> > base-commit: 9852d85ec9d492ebef56dc5f229416c925758edc
> > change-id: 20230604-dcd-type2-upstream-0cd15f6216fd
> >
> > Best regards,
> > --
> > Ira Weiny <ira.weiny@xxxxxxxxx>
> >
>
> --
> Fan Ni
--
Fan Ni