[LSF/MM] CXL Boot to Bash - Section 0a: CFMWS and NUMA Flexiblity

From: Gregory Price
Date: Fri Mar 07 2025 - 22:23:18 EST

Next message: Chao Gao: "Re: [PATCH v3 08/10] x86/fpu/xstate: Add CET supervisor xfeature support"
Previous message: Jakub Kicinski: "Re: [PATCH net-next v1 0/4] net: remove rtnl_lock from the callers of queue APIs"
In reply to: Gregory Price: "Re: [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

In the last section we discussed how the CEDT CFMWS and SRAT Memory
Affinity structures are used by linux to "create" NUMA nodes (or at
least mark them as possible). However, the examples I used suggested
that there was a 1-to-1 relationship between CFMWS and devices or
host bridges.

This is not true - in fact, CFMWS are a simply a carve out of System
Physical Address space which may be used to map any number of endpoint
devices behind the associated Host Bridge(s).

The limiting factor is what your platform vendor BIOS supports.

This section describes a handful of *possible* configurations, what NUMA
structure they will create, and what flexibility this provides.

All of these CFMWS configurations are made up, and may or may not exist
in real machines. They are a conceptual teching tool, not a roadmap.

(When discussing interleave in this section, please note that I am
intentionally omitting details about decoder programming, as this
will be covered later.)

-------------------------------
One 2GB Device, Multiple CFMWS.
-------------------------------
Lets imagine we have one 2GB device attached to a host bridge.

In this example, the device hosts 2GB of persistent memory - but we
might want the flexibility to map capacity as volatile or persistent.

The platform vendor may decide that they want to reserve two entirely
separate system physical address ranges to represent the capacity.

```
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000100000000 <- Memory Region
Window size : 0000000080000000 <- 2GB
Interleave Members (2^n) : 00
Interleave Arithmetic : 00
Reserved : 0000
Granularity : 00000000
Restrictions : 0006 <- Bit(2) - Volatile
QtgId : 0001
First Target : 00000007 <- Host Bridge _UID

Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000200000000 <- Memory Region
Window size : 0000000080000000 <- 2GB
Interleave Members (2^n) : 00
Interleave Arithmetic : 00
Reserved : 0000
Granularity : 00000000
Restrictions : 000A <- Bit(3) - Persistant
QtgId : 0001
First Target : 00000007 <- Host Bridge _UID

NUMA effect: 2 nodes marked POSSIBLE (1 for each CFMWS)
```

You might have a CEDT with two CFMWS as above, where the base addresses
are `0x100000000` and `0x200000000` respectively, but whose window sizes
cover the entire 2GB capacity of the device. This affords the user
flexibility in where the memory is mapped depending on if it is mapped
as volatile or persistent while keeping the two SPA ranges separate.

This is allowed because the endpoint decoders commit device physical
address space *in order*, meaning no two regions of device physical
address space can be mapped to more than one system physical address.

i.e.: DPA(0) can only map to SPA(0x200000000) xor SPA(0x100000000)

(See Section 2a - decoder programming).

-------------------------------
Two Devices On One Host Bridge.
-------------------------------
Lets say we have two CXL 2GB devices behind a single host bridge, and we
may or may not want to interleave some or all of those devices.

There are (at least) 2 ways to provide this flexibility.

First, we might simply have two CFMWS.
```
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000100000000 <- Memory Region
Window size : 0000000080000000 <- 2GB
Interleave Members (2^n) : 00
Interleave Arithmetic : 00
Reserved : 0000
Granularity : 00000000
Restrictions : 0006 <- Bit(2) - Volatile
QtgId : 0001
First Target : 00000007 <- Host Bridge _UID

Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000200000000 <- Memory Region
Window size : 0000000080000000
Interleave Members (2^n) : 00
Interleave Arithmetic : 00
Reserved : 0000
Granularity : 00000000
Restrictions : 0006 <- Bit(2) - Volatile
QtgId : 0001
First Target : 00000007 <- Host Bridge _UID

NUMA effect: 2 nodes marked POSSIBLE (1 for each CFMWS)
```

These CFMWS target the same host bridge, but are NOT necessarily limited
to mapping memory from any one device. We could program decoders in
either of the following ways.

Example: Host bridge and endpoints are programmed WITHOUT interleave.
```
Decoders
CXL Root
/ \
decoder0.0 decoder1.0
[0x100000000, 0x17FFFFFFF] [0x200000000, 0x27FFFFFFF]
\ /
Host Bridge
/ \
decoder2.0 decoder2.1
[0x100000000, 0x17FFFFFFFF] [0x200000000, 0x27FFFFFFF]
| |
Endpoint 0 Endpoint 1
| |
decoder4.0 decoder5.0
[0x100000000, 0x17FFFFFFF] [0x200000000, 0x27FFFFFFF]

NUMA effect:
All of Endpoint 0 memory is on NUMA node A
All of Endpoint 1 memory is on NUMA node B
```

Alternatively, these decoders could be programmed to interleave memory
accesses across endpoints. We'll cover this configuration in-depth
later. For now, just know the above structure means that each endpoint
has its own NUMA node - but this is not required.

-------------------------------------------------------------
Two Devices On One Host Bridge - With and Without Interleave.
-------------------------------------------------------------
What if we wanted some capacity on each endpoint hosted on its own NUMA
node, and wanted to interleave a portion of each device capacity?

We could produce the following CFMWS configuration.
```
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000100000000 <- Memory Region 1
Window size : 0000000080000000 <- 2GB
Interleave Members (2^n) : 00
Interleave Arithmetic : 00
Reserved : 0000
Granularity : 00000000
Restrictions : 0006 <- Bit(2) - Volatile
QtgId : 0001
First Target : 00000007 <- Host Bridge _UID

Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000200000000 <- Memory Region 2
Window size : 0000000080000000 <- 2GB
Interleave Members (2^n) : 00
Interleave Arithmetic : 00
Reserved : 0000
Granularity : 00000000
Restrictions : 0006 <- Bit(2) - Volatile
QtgId : 0001
First Target : 00000007 <- Host Bridge _UID

Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000300000000 <- Memory Region 3
Window size : 0000000100000000 <- 4GB
Interleave Members (2^n) : 00
Interleave Arithmetic : 00
Reserved : 0000
Granularity : 00000000
Restrictions : 0006 <- Bit(2) - Volatile
QtgId : 0001
First Target : 00000007 <- Host Bridge _UID

NUMA effect: 3 nodes marked POSSIBLE (1 for each CFMWS)
```

In this configuration, we could still do what we did with the prior
configuration (2 CFMWS), but we could also use the third root decoder
to simplify decoder programming of interleave.

Since the third region has sufficient capacity (4GB) to cover both
devices (2GB/each), we can actually associate the entire capacity of
both devices in that region.

We'll discuss this decoder structure in-depth in Section 4.

-------------------------------------
Two devices on separate host bridges.
-------------------------------------
We may have placed the devices on separate host bridges.

In this case we may naturally have one CFMWS per host bridge.

```
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000100000000 <- Memory Region 1
Window size : 0000000080000000 <- 2GB
Interleave Members (2^n) : 00
Interleave Arithmetic : 00
Reserved : 0000
Granularity : 00000000
Restrictions : 0006 <- Bit(2) - Volatile
QtgId : 0001
First Target : 00000007 <- Host Bridge _UID

Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000200000000 <- Memory Region 2
Window size : 0000000080000000 <- 2GB
Interleave Members (2^n) : 00
Interleave Arithmetic : 00
Reserved : 0000
Granularity : 00000000
Restrictions : 0006 <- Bit(2) - Volatile
QtgId : 0001
First Target : 00000006 <- Host Bridge _UID

NUMA Effects: 2 NUMA nodes marked POSSIBLE
```

But we may also want to interleave *across* host bridges. To do this,
the platform vendor may add the following CFMWS (either by itself if
done statically, or in addition to the above two for flexibility).

```
Subtable Type : 01 [CXL Fixed Memory Window Structure]
Reserved : 00
Length : 002C
Reserved : 00000000
Window base address : 0000000300000000 <- Memory Region
Window size : 0000000100000000 <- 4GB
Interleave Members (2^n) : 01 <- 2-way interleave
Interleave Arithmetic : 00
Reserved : 0000
Granularity : 00000000
Restrictions : 0006 <- Bit(2) - Volatile
QtgId : 0001
First Target : 00000007 <- Host Bridge 7
Next Target : 00000006 <- Host Bridge 6

NUMA Effect: an additional NUMA node marked POSSIBLE
```

This greatly simplifies the decoder programming structure, and allows
us to aggregate bandwidth across host bridges. The decoder programming
might look as follows in this setup.

```
Decoders:
CXL Root
|
decoder0.0
[0x300000000, 0x3FFFFFFFF]
/ \
Host Bridge 7 Host Bridge 6
/ \
decoder1.0 decoder2.0
[0x300000000, 0x3FFFFFFFFF] [0x300000000, 0x3FFFFFFFF]
| |
Endpoint 0 Endpoint 1
| |
decoder3.0 decoder4.0
[0x300000000, 0x3FFFFFFFF] [0x300000000, 0x3FFFFFFFF]
```

We'll discuss this more in-depth in section 4 - but you can see how
straight-forward this is. All the decoders are programmed the same.

----------
SRAT Note.
----------
If you remember from the first portion of Section 0, the SRAT may be
used to statically assign memory regions to specific proximity domains.

```
Subtable Type : 01 [Memory Affinity]
Length : 28
Proximity Domain : 00000001 <- NUMA Node 1
Reserved1 : 0000
Base Address : 000000C050000000 <- Physical Memory Region
Address Length : 0000003CA0000000
```

There is a careful dance between the CEDT and SRAT tables and how NUMA
nodes are created. If things don't look quite the way you expect - check
the SRAT Memory Affinity entries and CEDT CFMWS to determine what your
platform actually supports in terms of flexible topologies.

--------
Summary.
--------
In the first part of Section 0 we showed how CFMWS and SRAT affect how
Linux creates NUMA nodes. Here we demonstrated that CFMWS are not a
1-to-1 relationship to either CXL devices or Host Bridges.

Instead, CFMWS are simply a System Physical Address carve out which can
be used in a number of ways to define your memory topology in software.

This is a core piece of the "Software Defined Memory" puzzle.

How your platform vendor decides to program the CEDT will dictate how
flexibly you can manage CXL devices in software.

~Gregory

Next message: Chao Gao: "Re: [PATCH v3 08/10] x86/fpu/xstate: Add CET supervisor xfeature support"
Previous message: Jakub Kicinski: "Re: [PATCH net-next v1 0/4] net: remove rtnl_lock from the callers of queue APIs"
In reply to: Gregory Price: "Re: [LSF/MM] CXL Boot to Bash - Section 0: ACPI and Linux Resources"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]