Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes

From: Gregory Price
Date: Tue Mar 11 2025 - 00:43:04 EST

Next message: Naman Jain: "Re: [PATCH] uio_hv_generic: Fix sysfs creation path for ring buffer"
Previous message: Al Viro: "Re: Catching use-after-free easily in linux kernel."
In reply to: Yunjeong Mun: "Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes"
Next in thread: Yunjeong Mun: "Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Mar 11, 2025 at 01:02:07PM +0900, Yunjeong Mun wrote:

forenote - Hi Andrew, please hold off on the auto-configuration patch
for now, the sk group has identified a hotplug issue we need to work out
and we'll likely need to merge these two patch set together. I really
appreciate your patience with this feature.

> Hi Gregory,
>
> In my understanding, the reason we are seeing 12 NUMA node is because
> it loops through node_states[N_POSSIBLE] and its value is 4095 (twelves ones)
> in the code [1] below:
>
... snip ...

Appreciated, so yes this confirms what i thought was going on. There's
4 host bridges, 2 devices on each host bridge, and an extra CFMWS per
socket that is intended to interleave across the host bridges.

As you mention below, the code in acpi/numa/srat.c will create 1 NUMA
node per SRAT Memory Affinity Entry - and then also 1 NUMA node per
CFMWS that doesn't have a matching SRAT entry (with a known corner case
for a missing SRAT which doesn't apply here).

So essentialy what the system is doing is marking that it's absolutely
possible to create 1 region per device and also 1 region that
interleaves across host each pair of host bridges (I presume this is a
dual socket system?).

So, tl;dr: All these nodes are valid and this configuration is correct.

Weighted interleave presently works fine as intended, but with the
inclusion of the auto-configuration, there will be issues for your
system configuration. This means we probably need to consider
merging these as a group.

During boot, the following will occur

1) drivers/acpi/numa/srat.c marks 12 nodes as possible
0-1) Socket nodes
2-3) Cross-host-bridge interleave nodes
4-11) single region nodes

2) drivers/cxl/* will probe the various devices and create
a root decoder for each CXL Fixed Memory Window
decoder0.0 - decoder11.0 (or maybe decoder0.0 - decoder0.11)

3) during probe auto-configuration of wieghted interleave occurs as a
result of this code being called with hmat or cdat data:

void node_set_perf_attrs() {
...
/* When setting CPU access coordinates, update mempolicy */
if (access == ACCESS_COORDINATE_CPU) {
if (mempolicy_set_node_perf(nid, coord)) {
pr_info("failed to set mempolicy attrs for node %d\n",
nid);
}
}
...
}

under the current system, since we calculate with N_POSSIBLE, all nodes
will be assigned weights (assuming HMAT or CDAT data is available for
all of them).

We actually have a few issues here

1) If all nodes are included in the weighting reduction, we're actually
over-representing a particular set of hardware. The interleave node
and the individual device nodes would actually over-represent the
bandwidth available (comparative to the CPU nodes).

2) As stated on this patch line, just switching to N_MEMORY causes
issues with hotplug - where the bandwidth can be reported, but if
memory hasn't been added yet then we'll end up with wrong weights
because it wasn't included in the calculation.

3) However, not exposing the nodes because N_MEMORY isn't set yet
a) prevents pre-configuration before memory is onlined, and
b) hides the implications of hotplugging memory into a node from the
user (adding memory causes a re-weight and may affect an
interleave-all configuration).

but - i think it's reasonable that anyone using weighted-interleave is
*probably* not going to have nodes come and go. It just seems like a
corner case that isn't reasonable to spend time supporting.

So coming back around to the hotplug patch line, I do think it's
reasonable hide nodes marked !N_MEMORY, but consider two issues:

1) In auto mode, we need to re-weight on hotplug to only include
onlined nodes. This is because the reduction may be sensitive
to the available bandwidth changes.

This behavior needs to be clearly documented.

2) We need to clearly define what the weight of a node will be when
in manual mode and a node goes (memory -> no memory -> memory)
a) does it retain it's old, manually set weight?
b) does it revert to 1?

Sorry for the long email, just working through all the implications.

I think the proposed hotplug patch is a requirement for the
auto-configuration patch set.

~Gregory

Next message: Naman Jain: "Re: [PATCH] uio_hv_generic: Fix sysfs creation path for ring buffer"
Previous message: Al Viro: "Re: Catching use-after-free easily in linux kernel."
In reply to: Yunjeong Mun: "Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes"
Next in thread: Yunjeong Mun: "Re: [PATCH 2/2 v6] mm/mempolicy: Don't create weight sysfs for memoryless nodes"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]