Re: [LSF/MM] CXL Boot to Bash - Section 0a: CFMWS and NUMA Flexiblity

From: Jonathan Cameron
Date: Fri Mar 14 2025 - 07:09:54 EST


On Thu, 13 Mar 2025 14:17:57 -0400
Gregory Price <gourry@xxxxxxxxxx> wrote:

> On Thu, Mar 13, 2025 at 05:20:04PM +0000, Jonathan Cameron wrote:
> > Gregory Price <gourry@xxxxxxxxxx> wrote:
> >
> > > -------------------------------
> > > One 2GB Device, Multiple CFMWS.
> > > -------------------------------
> > > Lets imagine we have one 2GB device attached to a host bridge.
> > >
> > > In this example, the device hosts 2GB of persistent memory - but we
> > > might want the flexibility to map capacity as volatile or persistent.
> >
> > Fairly sure we block persistent in a volatile CFMWS in the kernel.
> > Any bios actually does this?
> >
> > You might have a variable partition device but I thought in kernel at
> > least we decided that no one was building that crazy?
> >
>
> This was an example I pulled from Dan's notes elsewhere (i think).
>
> I was unaware that we blocked mapping persistent as volatile. I was
> working off the assumption that could be flexible mapped similar to...
> er... older, non-cxl hardware... cough.

You can use it as volatile, but that doesn't mean we allow it in a CFMWS
that says the host PA range is not suitable for persistent.
A BIOS might though I think.

>
> > Maybe a QoS split is a better example to motivate one range, two places?
> >
>
> That probably makes sense?
>
> > > -------------------------------------------------------------
> > > Two Devices On One Host Bridge - With and Without Interleave.
> > > -------------------------------------------------------------
> > > What if we wanted some capacity on each endpoint hosted on its own NUMA
> > > node, and wanted to interleave a portion of each device capacity?
> >
> > If anyone hits the lock on commit (i.e. annoying BIOS) the ordering
> > checks on HPA kick in here and restrict flexibility a lot
> > (assuming I understand them correctly that is)
> >
> > This is a good illustration of why we should at some point revisit
> > multiple NUMA nodes per CFMWS. We have to burn SPA space just
> > to get nodes. From a spec point of view all that is needed here
> > is a single CFMWS.
> >
>
> Along with the above note, and as mentioned on discord, I think this
> whole section naturally evolves into a library of "Sane configurations"
> and "We promise nothing for `reasons`" configurations.

:) Snag is that as Dan pointed out on discord we assume this applies
even without the lock. So it is possible to have device and host
hardware combinations where things are forced to be very non-intuitive.


>
> Maybe that turns into a kernel doc section that requires updating if
> a platform disagrees / comes up with new sane configurations. This is
> certainly the most difficult area to lock down because we have no idea
> who is going to `innovate` and how.
Yup. It gets much more 'fun' once DCD partitions/ regions enter the game
as there are many more types of memory.

Jonathan

>
> ~Gregory