Re: CXL Boot to Bash - Section 2: The Drivers
From: Gregory Price
Date: Thu Feb 06 2025 - 10:59:57 EST
On Wed, Feb 05, 2025 at 04:47:17PM -0800, Dan Williams wrote:
> Gregory Price wrote:
> > [/sys/bus/cxl/devices]# ls
> > dax_region0 decoder0.0 decoder1.0 decoder2.0 .....
> > dax_region1 decoder0.1 decoder1.1 decoder3.0 .....
> >
> > ^^^ These dax regions require `CONFIG_DEV_DAX_CXL` enabled to fully
> > surface as dax devices, which can then be converted to system ram.
>
> At least for this problem the plan is to fall back to
> CONFIG_DEV_DAX_HMEM [1] which skips all of the RAS and device
> enumeration benefits and just shunts EFI_MEMORY_SP over to device_dax.
>
Hm, would this actually happen in the scenario where CONFIG_DEV_DAX_CXL
is not enabled but everything else is? The region0 still gets created
and associated with the resource, but the dax_region0 never gets
created.
On one system I have I see the following:
c050000000-fcefffffff : Soft Reserved
c050000000-fcefffffff : CXL Window 0
c050000000-fcefffffff : region0
c050000000-fcefffffff : dax0.0
c050000000-fcefffffff : System RAM (kmem)
fcf0000000-ffffffffff : Reserved
10000000000-1035fffffff : Soft Reserved
10000000000-1035fffffff : CXL Window 1
10000000000-1035fffffff : region1
10000000000-1035fffffff : dax1.0
10000000000-1035fffffff : System RAM (kmem)
I would expect the above HMEM/shunt to only work if everything down
through CXL Window 0 is torn down.
But if CONFIG_DEV_DAX_CXL is not enabled, everything "succeeds", it just
doesn't "Do what you want"(TM) - dax0.0 and RAM entries are absent.
It makes me wonder whether the driver over-componentized the build.
> I am otherwise open to suggestions about a better model for how to
> handle a type of memory capacity that elicits diverging opinions on
> whether it should be treated as System RAM, dedicated application
> memory, or some kind of cold-memory swap target.
>
My gut tells me there's no "elegant solution" here given that user
intent is fairly unknowable - i.e. best we can do is make the build
and boot options easier to understand.
> > ---------------------------------------------------------------
> > Step 6: DAX surfacing Memory Blocks - First bit of User Policy.
> > ---------------------------------------------------------------
> >
> > The last step in surfacing memory to allocators is to convert a dax
> > device into memory blocks. On most default kernel builds, dax devices
> > are not automatically converted to SystemRAM.
>
> I thought most distributions are shipping with
> CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE, or the default online udev rule?
> For example Fedora is CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=y and RHEL is
> CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n, but with the udev hotplug rule.
>
Good point, my bias take showing up in the notes here. I didn't know
RHEL had gotten as far as a udev rule already. I'll adjust my notes.
But this also hides some nuance as well - the default behavior onlines
memory into ZONE_NORMAL with DEFAULT_ONLINE (next section).
> > Alternatively, this can be done at Build or Boot time using
> > CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE (v6.13 or below)
> > CONFIG_MHP_DEFAULT_ONLINE_TYPE_* (v6.14 or above)
> > memhp_default_state=* (boot param predating cxl)
>
> Oh, TIL the new CONFIG_MHP_DEFAULT_ONLINE_TYPE_* option.
>
It was only just added:
https://lore.kernel.org/linux-mm/20241226182918.648799-1-gourry@xxxxxxxxxx/
Basically creates parity between memhp_default_state and build options.
> > The base is 256MB aligned (the minimum for the CXL Spec), and the
> > window size is 512MB. This results in a loss of almost a full memory
> > block worth of memory (~1280MB on the front, and ~512MB on the back).
> >
> > This is a loss of ~0.7% of capacity (1.5GB) for that region (121.25GB).
>
> This feels like an example, of "hey platform vendors, I understand
> that spec grants you the freedom to misalign, please refrain from taking
> advantage of that freedom".
>
Only x86 appears to actually do this (presently) - so is this a real
constraint or just a quirk of how the x86 arch code has chosen to
"optimize memory block size"?
Granted I'm a platform consumer, not a vendor - but I wouldn't even know
where to look to see where this constraint is defined (if it is).
All I'd know is "CXL Says I can align to 256MB, and minimum memory block
size on linux is 256MB so allons y!"
On the linux side - these platforms are now out there, in the wild.
So the surface impression now appears to be that linux just throws
away ~0.5% of your CXL capacity for no reason on these platforms.
That said, I also understand that more memory blocks might affect
allocation performance when the system is pressured - but losing
gigabytes of memory can also reduce performance.
(Preview of one of my next nuance additions in section 3)
If this (advisement) change is unwelcome, then we should be spewing
a really loud warning somewhere so vendors get signal for consumers.
~Gregory