Re: [PATCH 1/2] x86/numa: Introduce numa_fill_memblks()
From: Dan Williams
Date: Thu May 18 2023 - 21:56:19 EST
Dave Hansen wrote:
> On 5/18/23 17:26, Alison Schofield wrote:
> > On Thu, May 18, 2023 at 05:08:16PM -0700, Dave Hansen wrote:
> >> On 5/18/23 17:04, alison.schofield@xxxxxxxxx wrote:
> >>> The initial use case is the ACPI driver that needs to extend
> >>> SRAT defined proximity domains to an entire CXL CFMWS Window[1].
> >>
> >> Dumb question time: Why didn't the SRAT just cover this sucker in the
> >> first place? Are we fixing up a BIOS bug or is there a legitimate
> >> reason that the SRAT didn't cover it up front?
> >>
> > There is no requirement that the BIOS describe (in the SRAT) all the
> > HPA assigned to a CFMWS Window. The HPA range may not actually map to
> > any memory at boot time. It can be persistent capacity or may be there
> > to enable hot-plug. IIUC BIOS can pick and choose and define volatile
> > regions wherever it pleases.
>
> I understand that it _can_ do this. I'm trying to get to the reasoning
> of why.
>
> Is this essentially so that the physical address space doesn't have to
> be *committed* to a single use up front? For RAM, I guess this wasn't a
> problem because there was only a finite amount of RAM that could get
> hotplugged into a single node.
Right, for RAM the hotplug degrees of freedom was predetermined by the
platform definition.
> But with these fancy schmancy new devices, it's really hard to figure
> out how much space will show up and what performance it will have until
> you actually start poking at it.
It's less "until actually start poking at it" and more the BIOS just
declines to poke at some CXL topologies at boot, and does not poke
post-boot.
> The firmware wasn't _quite_ sure how
> it wanted to burn the physical address space at the time the SRAT was
> created. But, now it knows, and this is handling the case where the
> firmware only expands an adjacent chunk of physical address space.
For devices that are present at boot the BIOS mostly does the right
thing and just maps them into the EFI memory map and produces all the
other ACPI collateral. For devices that are added after boot, or devices
that fall outside of a configuration that the BIOS is prepared to handle
it just creates a CXL Window with empty capacity and says "OS, you take
it from here. Here's some physical address space you can map things,
good luck!"
Compare that to ACPI hotplug where the platform knows about a
preconfigured amount of memory that might come online later, and can
produce all the relevant ACPI collateral upfront.
In other forums I have advocated against SRAT covering the unmapped
capacity of a CXL window because of the lies that firmware would need to
convey in the HMAT and SLIT for those empty proximity domains. The CXL
specification provides for an architectural way to get all the
information about a memory range that previously had to be packaged up
into an ACPI table.