Re: CXL Boot to Bash - Section 2: The Drivers

From: Dan Williams
Date: Wed Feb 05 2025 - 19:48:05 EST


Gregory Price wrote:
> (background reading as we build up complexity)

Thanks for this taxonomy!

>
> Driver Management - Decoders, HPA/SPA, DAX, and RAS.
>
> The Drivers
> ===========
> ----------------------
> The Story Up 'til Now.
> ----------------------
>
> When we left the Platform arena, assuming we've configured with special
> purpose memory, we are left with an entry in the memory map like so:
>
> BIOS-e820: [mem 0x000000c050000000-0x000000fcefffffff] soft reserved
> /proc/iomem: c050000000-fcefffffff : Soft Reserved
>
> This resource (see mm/resource.c) is left unused until a driver comes
> along to actually surface it to allocators (or some other interface).
>
> In our case, the drivers involved (or at least the ones we'll reference)
>
> drivers/base/ : device probing, memory (block) hotplug
> drivers/acpi/ : device hotplug
> drivers/acpi/numa : NUMA ACPI table info (SRAT, CEDT, HMAT, ...)
> drivers/pci/ : PCI device probing
> drivers/cxl/ : CXL device probing
> drivers/dax/ : cxl device to memory resource association
>
> We don't necessarily care about the specifics of each driver, we'll
> focus on just the aspects that ultimately affect memory management.
>
> -------------------------------
> Step 4: Basic build complexity.
> -------------------------------
> To make a long story short:
>
> CXL Build Configurations:
> CONFIG_CXL_ACPI
> CONFIG_CXL_BUS
> CONFIG_CXL_MEM
> CONFIG_CXL_PCI
> CONFIG_CXL_PORT
> CONFIG_CXL_REGION
>
> DAX Build Configurations:
> CONFIG_DEV_DAX
> CONFIG_DEV_DAX_CXL
> CONFIG_DEV_DAX_KMEM
>
> Without all of these enabled, your journey will end up cut short because
> some piece of the probe process will stop progressing.
>
> The most common misconfiguration I run into is CONFIG_DEV_DAX_CXL not
> being enabled. You end up with memory regions without dax devices.
>
> [/sys/bus/cxl/devices]# ls
> dax_region0 decoder0.0 decoder1.0 decoder2.0 .....
> dax_region1 decoder0.1 decoder1.1 decoder3.0 .....
>
> ^^^ These dax regions require `CONFIG_DEV_DAX_CXL` enabled to fully
> surface as dax devices, which can then be converted to system ram.

At least for this problem the plan is to fall back to
CONFIG_DEV_DAX_HMEM [1] which skips all of the RAS and device
enumeration benefits and just shunts EFI_MEMORY_SP over to device_dax.

There is also the panic button of efi=nosoftreserve which is the flag of
surrender if the kernel fails to parse the CXL configuration.

I am otherwise open to suggestions about a better model for how to
handle a type of memory capacity that elicits diverging opinions on
whether it should be treated as System RAM, dedicated application
memory, or some kind of cold-memory swap target.

[1]: http://lore.kernel.org/cover.1737046620.git.nathan.fontenot@xxxxxxx

> ---------------------------------------------------------------
> Step 5: The CXL driver associating devices and iomem resources.
> ---------------------------------------------------------------
>
> The CXL driver wires up the following devices:
> root : CXL root
> portN : An intermediate or endpoint destination for accesses
> memN : memory devices
>
>
> Each device in the heirarchy may have one or more decoders
> decoderN.M : Address routing and translation devices
>
>
> The driver will also create additional objects and associations
> regionN : device-to-iomem resource mapping
> dax_regionN : region-to-dax device mapping
>
>
> Most associations built by the driver are done by validating decoders
> against each other at each point in the heirarchy.
>
> Root decoders describe memory regions and route DMA to ports.
> Intermediate decoders route DMA through CXL fabric.
> Endpoint decoders translate addresses (Host to device).
>
>
> A Root port has 1 decoder per associated CFMW in the CEDT
> decoder0.0 -> `c050000000-fcefffffff : Soft Reserved`
>
>
> A region (iomem resource mapping) can be created for these decoders
> [/sys/bus/cxl/devices/region0]# cat resource size target0
> 0xc050000000 0x3ca0000000 decoder5.0
>
>
> A dax_region surfaces these regions as a dax device
> [/sys/bus/cxl/devices/dax_region0/dax0.0]# cat resource
> 0xc050000000
>
>
> So in a simple environment with 1 device, we end up with a mapping
> that looks something like this.
>
> root --- decoder0.0 --- region0 -- dax_region0 -- dax0
> | | |
> port1 --- decoder1.0 |
> | | |
> endpoint0 --- decoder3.0--------/
>
>
> Much of the complexity in region creation stems from validating decoder
> programming and associating regions with targets (endpoint decoders).
>
> The take-away from this section is the existence of "decoders", of which
> there may be an arbitrary number between the root and endpoint.
>
> This will be relevant when we talk about RAS (Poison) and Interleave.

Good summary. I often look at this pile of objects and wonder "why so
complex", but then I look at the heroics of drivers/edac/. Compared to
that wide range of implementation specific quirks of various memory
controllers, the CXL object hierarchy does not look that bad.

> ---------------------------------------------------------------
> Step 6: DAX surfacing Memory Blocks - First bit of User Policy.
> ---------------------------------------------------------------
>
> The last step in surfacing memory to allocators is to convert a dax
> device into memory blocks. On most default kernel builds, dax devices
> are not automatically converted to SystemRAM.

I thought most distributions are shipping with
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE, or the default online udev rule?
For example Fedora is CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y and RHEL is
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n, but with the udev hotplug rule.

> Policy Choices
> userland policy: daxctl
> default-online : CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
> or
> CONFIG_MHP_DEFAULT_ONLINE_TYPE_*
> or
> memhp_default_state=*
>
> To convert a dax device to SystemRAM utilizing daxctl:
>
> daxctl online-memory dax0.0 [--no-movable]

On RHEL at least it finds that udev already took care of it.

>
> By default the memory will online into ZONE_MOVABLE
> The --no-movable option will online the memory in ZONE_NORMAL
>
>
> Alternatively, this can be done at Build or Boot time using
> CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE (v6.13 or below)
> CONFIG_MHP_DEFAULT_ONLINE_TYPE_* (v6.14 or above)
> memhp_default_state=* (boot param predating cxl)

Oh, TIL the new CONFIG_MHP_DEFAULT_ONLINE_TYPE_* option.

>
> I will save the discussion of ZONE selection to the next section,
> which will cover more memory-hotplug specifics.
>
> At this point, the memory blocks are exposed to the kernel mm allocators
> and may be used as normal System RAM.
>
>
> ---------------------------------------------------------
> Second bit of nuanced complexity: Memory Block Alignment.
> ---------------------------------------------------------
> In section 1, we introduced CEDT / CFMW and how they map to iomem
> resources. In this section we discussed out we surface memory blocks
> to the kernel allocators.
>
> However, at no time did platform, arch code, and driver communicate
> about the expected size of a memory block. In most cases, the size
> of a memory block is defined by the architecture - unaware of CXL.
>
> On x86, for example, the heuristic for memory block size is:
> 1) user boot-arg value
> 2) Maximize size (up to 2GB) if operating on bare metal
> 3) Use smallest value that aligns with the end of memory
>
> The problem is that [SOFT RESERVED] memory is not considered in the
> alignment calculation - and not all [SOFT RESERVED] memory *should*
> be considered for alignment.
>
> In the case of our working example (real system, btw):
>
> Subtable Type : 01 [CXL Fixed Memory Window Structure]
> Window base address : 000000C050000000
> Window size : 0000003CA0000000
>
> The base is 256MB aligned (the minimum for the CXL Spec), and the
> window size is 512MB. This results in a loss of almost a full memory
> block worth of memory (~1280MB on the front, and ~512MB on the back).
>
> This is a loss of ~0.7% of capacity (1.5GB) for that region (121.25GB).

This feels like an example, of "hey platform vendors, I understand
that spec grants you the freedom to misalign, please refrain from taking
advantage of that freedom".

>
> [1] has been proposed to allow for drivers (specifically ACPI) to advise
> the memory hotplug system on the suggested alignment, and for arch code
> to choose how to utilize this advisement.
>
> [1] https://lore.kernel.org/linux-mm/20250127153405.3379117-1-gourry@xxxxxxxxxx/
>
>
> --------------------------------------------------------------------
> The Complexity story up til now (what's likely to show up in slides)
> --------------------------------------------------------------------
> Platform and BIOS:
> May configure all the devices prior to kernel hand-off.
> May or may not support reconfiguring / hotplug.
> BIOS and EFI:
> EFI_MEMORY_SP - used to defer management to drivers
> Kernel Build and Boot:
> CONFIG_EFI_SOFT_RESERVE=n - Will always result in CXL as SystemRAM
> nosoftreserve - Will always result in CXL as SystemRAM
> kexec - SystemRAM configs carry over to target
> Driver Build Options Required
> CONFIG_CXL_ACPI
> CONFIG_CXL_BUS
> CONFIG_CXL_MEM
> CONFIG_CXL_PCI
> CONFIG_CXL_PORT
> CONFIG_CXL_REGION
> CONFIG_DEV_DAX
> CONFIG_DEV_DAX_CXL
> CONFIG_DEV_DAX_KMEM
> User Policy
> CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE (<=v6.13)
> CONFIG_MHP_DEFAULT_ONLINE_TYPE (>=v6.14)
> memhp_default_state (boot param)
> daxctl online-memory daxN.Y (userland)

memory hotlpug udev rule (userland)

> Nuances
> Early-boot resource re-use
> Memory Block Alignment
>
> --------------------------------------------------------------------
> Next Up:
> Memory (Block) Hotplug - Zones and Kernel Use of CXL
> RAS - Poison, MCE, and why you probably want CXL=ZONE_MOVABLE
> Interleave - RAS and Region Management (Hotplug-ability)

Really appreciate you organizing all of this information.