Re: [PATCH v2 2/4] mm/mempolicy: Support memory hotplug in weighted interleave

From: Rakie Kim
Date: Thu Mar 13 2025 - 02:34:10 EST


On Wed, 12 Mar 2025 12:03:01 -0400 Gregory Price <gourry@xxxxxxxxxx> wrote:
> On Wed, Mar 12, 2025 at 04:56:25PM +0900, Rakie Kim wrote:
> > The weighted interleave policy distributes page allocations across multiple
> > NUMA nodes based on their performance weight, thereby optimizing memory
> > bandwidth utilization. The weight values for each node are configured
> > through sysfs.
> >
> > Previously, the sysfs entries for configuring weighted interleave were only
> > created during initialization. This approach had several limitations:
> > - Sysfs entries were generated for all possible nodes at boot time,
> > including nodes without memory, leading to unnecessary sysfs creation.
>
> It's not that it's unnecessary, it's that it allowed for configuration
> of nodes which may not have memory now but may have memory in the
> future. This was not well documented.

I will update the commit message to reflect your feedback.

>
> > - Some memory devices transition to an online state after initialization,
> > but the existing implementation failed to create sysfs entries for
> > these dynamically added nodes. As a result, memory hotplugged nodes
> > were not properly recognized by the weighed interleave mechanism.
> >
>
> The current system creates 1 node per N_POSSIBLE nodes, and since nodes
> can't transition between possible and not-possible your claims here are
> contradictory.
>
> I think you mean that simply switching from N_POSSIBLE to N_MEMORY is
> insufficient since nodes may transition in and out of the N_MEMORY
> state. Therefore this patch utilizes a hotplug callback to add and
> remove sysfs entries based on whether a node is in the N_MEMORY set.

I will update the commit message to reflect your feedback.

>
> > To resolve these issues, this patch introduces two key improvements:
> > 1) At initialization, only nodes that are online and have memory are
> > recognized, preventing the creation of unnecessary sysfs entries.
> > 2) Nodes that become available after initialization are dynamically
> > detected and integrated through the memory hotplug mechanism.
> >
> > With this enhancement, the weighted interleave policy now properly supports
> > memory hotplug, ensuring that newly added nodes are recognized and sysfs
> > entries are created accordingly.
> >
>
> It doesn't "support memory hotplug" so much as it "Minimizes weighted
> interleave to exclude memoryless nodes".

I will update the commit message to reflect your feedback.

>
> > Signed-off-by: Rakie Kim <rakie.kim@xxxxxx>
> > ---
> > mm/mempolicy.c | 47 ++++++++++++++++++++++++++++++++++++++++++-----
> > 1 file changed, 42 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 1691748badb2..94efff89e0be 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -113,6 +113,7 @@
> > #include <asm/tlbflush.h>
> > #include <asm/tlb.h>
> > #include <linux/uaccess.h>
> > +#include <linux/memory.h>
> >
> > #include "internal.h"
> >
> > @@ -3489,9 +3490,38 @@ static int add_weight_node(int nid, struct kobject *wi_kobj)
> > return 0;
> > }
> >
> > +struct kobject *wi_kobj;
> > +
> > +static int wi_node_notifier(struct notifier_block *nb,
> > + unsigned long action, void *data)
> > +{
> > + int err;
> > + struct memory_notify *arg = data;
> > + int nid = arg->status_change_nid;
> > +
> > + if (nid < 0)
> > + goto notifier_end;
> > +
> > + switch(action) {
> > + case MEM_ONLINE:
> > + err = add_weight_node(nid, wi_kobj);
> > + if (err) {
> > + pr_err("failed to add sysfs [node%d]\n", nid);
> > + kobject_put(wi_kobj);
> > + return NOTIFY_BAD;
> > + }
> > + break;
> > + case MEM_OFFLINE:
> > + sysfs_wi_node_release(node_attrs[nid], wi_kobj);
> > + break;
> > + }
>
> I'm fairly certain this logic is wrong. If I add two memory blocks and
> then remove one, would this logic not remove the sysfs entries despite
> there being a block remaining?

Regarding the assumption about node configuration:
Are you assuming that a node has two memory blocks and that
MEM_OFFLINE is triggered when one of them is offlined? If so, then
you are correct that this logic would need modification.

I performed a simple test by offlining a single memory block:
# echo 0 > /sys/devices/system/node/node2/memory100/online

In this case, MEM_OFFLINE was not triggered. However, I need to
conduct further analysis to confirm this behavior under different
conditions. I will review this in more detail and share my
findings, including the test methodology and results.

>
> > +
> > +notifier_end:
> > + return NOTIFY_OK;
> > +}
> > +
> > static int add_weighted_interleave_group(struct kobject *root_kobj)
> > {
> > - struct kobject *wi_kobj;
> > int nid, err;
> >
> > wi_kobj = kzalloc(sizeof(struct kobject), GFP_KERNEL);
> > @@ -3505,16 +3535,23 @@ static int add_weighted_interleave_group(struct kobject *root_kobj)
> > return err;
> > }
> >
> > - for_each_node_state(nid, N_POSSIBLE) {
> > + for_each_online_node(nid) {
> > + if (!node_state(nid, N_MEMORY))
>
> Rather than online node, why not just add for each N_MEMORY node -
> regardless of if its memory is online or not? If the memory is offline,
> then it will be excluded from the weighted interleave mechanism by
> nature of the node being invalid for allocations anyway.

Regarding the decision to check both N_MEMORY and N_ONLINE:
This was done to ensure consistency with the conditions under which
`wi_node_notifier` is triggered. Specifically, `MEM_ONLINE` is called
only when a node is in both the N_MEMORY and N_ONLINE states.

I will review this logic further. If my understanding is correct,
keeping the current implementation is the appropriate approach.
However, I will conduct additional testing to validate this and
provide further updates accordingly.

>
> > + continue;
> > +
> > err = add_weight_node(nid, wi_kobj);
> > if (err) {
> > pr_err("failed to add sysfs [node%d]\n", nid);
> > - break;
> > + goto err_out;
> > }
> > }
> > - if (err)
> > - kobject_put(wi_kobj);
> > +
> > + hotplug_memory_notifier(wi_node_notifier, DEFAULT_CALLBACK_PRI);
> > return 0;
> > +
> > +err_out:
> > + kobject_put(wi_kobj);
> > + return err;
> > }
> >
> > static void mempolicy_kobj_release(struct kobject *kobj)
> > --
> > 2.34.1
> >