Re: [PATCH v3 3/3] mm/mempolicy: Support memory hotplug in weighted interleave
From: Rakie Kim
Date: Mon Mar 24 2025 - 04:49:39 EST
On Fri, 21 Mar 2025 10:24:46 -0400 Gregory Price <gourry@xxxxxxxxxx> wrote:
> On Thu, Mar 20, 2025 at 01:17:48PM +0900, Rakie Kim wrote:
> ... snip ...
> > + mutex_lock(&sgrp->kobj_lock);
> > + if (sgrp->nattrs[nid]) {
> > + mutex_unlock(&sgrp->kobj_lock);
> > + pr_info("Node [%d] already exists\n", nid);
> > + kfree(new_attr);
> > + kfree(name);
> > + return 0;
> > + }
> >
> > - if (sysfs_create_file(&sgrp->wi_kobj, &node_attr->kobj_attr.attr)) {
> > - kfree(node_attr->kobj_attr.attr.name);
> > - kfree(node_attr);
> > - pr_err("failed to add attribute to weighted_interleave\n");
> > - return -ENOMEM;
> > + sgrp->nattrs[nid] = new_attr;
> > + mutex_unlock(&sgrp->kobj_lock);
> > +
> > + sysfs_attr_init(&sgrp->nattrs[nid]->kobj_attr.attr);
> > + sgrp->nattrs[nid]->kobj_attr.attr.name = name;
> > + sgrp->nattrs[nid]->kobj_attr.attr.mode = 0644;
> > + sgrp->nattrs[nid]->kobj_attr.show = node_show;
> > + sgrp->nattrs[nid]->kobj_attr.store = node_store;
> > + sgrp->nattrs[nid]->nid = nid;
>
> These accesses need to be inside the lock as well. Probably we can't
> get here concurrently, but I can't so so definitively that I'm
> comfortable blind-accessing it outside the lock.
You're right, and I appreciate your point. It's not difficult to apply your
suggestion, so I plan to update the code as follows:
sgrp->nattrs[nid] = new_attr;
sysfs_attr_init(&sgrp->nattrs[nid]->kobj_attr.attr);
sgrp->nattrs[nid]->kobj_attr.attr.name = name;
sgrp->nattrs[nid]->kobj_attr.attr.mode = 0644;
sgrp->nattrs[nid]->kobj_attr.show = node_show;
sgrp->nattrs[nid]->kobj_attr.store = node_store;
sgrp->nattrs[nid]->nid = nid;
ret = sysfs_create_file(&sgrp->wi_kobj,
&sgrp->nattrs[nid]->kobj_attr.attr);
if (ret) {
mutex_unlock(&sgrp->kobj_lock);
...
}
mutex_unlock(&sgrp->kobj_lock);
>
> > +static int wi_node_notifier(struct notifier_block *nb,
> > + unsigned long action, void *data)
> > +{
> ... snip ...
> > + case MEM_OFFLINE:
> > + sysfs_wi_node_release(nid);
>
> I'm still not convinced this is correct. `offline_pages()` says this:
>
> /*
> * {on,off}lining is constrained to full memory sections (or more
> * precisely to memory blocks from the user space POV).
> */
>
> And that is the function calling:
> memory_notify(MEM_OFFLINE, &arg);
>
> David pointed out that this should be called when offlining each memory
> block. This is not the same as simply doing `echo 0 > online`, you need
> to remove the dax device associated with the memory.
>
> For example:
>
> node1
> / \
> dax0.0 dax1.0
> | |
> mb1 mb2
>
>
> With this code, if I `daxctl reconfigure-device devmem dax0.0` it will
> remove the first memory block, causing MEM_OFFLINE event to fire and
> removing the node - despite the fact that dax1.0 is still present.
>
> This matters for systems with memory holes in CXL hotplug memory and
> also for systems with Dynamic Capacity Devices surfacing capacity as
> separate dax devices.
>
> ~Gregory
If all memory blocks belonging to a node are offlined, the node will lose its
`N_MEMORY` state before the notifier callback is invoked. This should help avoid
the issue you mentioned.
Please let me know your thoughts on this approach.
Rakie