Re: [PATCH v2 1/3] mm,slub: Do not special case N_NORMAL nodes for slab_nodes

From: Harry Yoo
Date: Tue Apr 08 2025 - 10:59:03 EST


On Tue, Apr 08, 2025 at 04:25:33PM +0200, David Hildenbrand wrote:
> On 08.04.25 16:18, Harry Yoo wrote:
> > On Tue, Apr 08, 2025 at 12:17:52PM +0200, David Hildenbrand wrote:
> > > On 08.04.25 10:41, Oscar Salvador wrote:
> > > > Currently, slab_mem_going_going_callback() checks whether the node has
> > > > N_NORMAL memory in order to be set in slab_nodes.
> > > > While it is true that gettind rid of that enforcing would mean
> > > > ending up with movables nodes in slab_nodes, the memory waste that comes
> > > > with that is negligible.
> > > >
> > > > So stop checking for status_change_nid_normal and just use status_change_nid
> > > > instead which works for both types of memory.
> > > >
> > > > Also, once we allocate the kmem_cache_node cache for the node in
> > > > slab_mem_online_callback(), we never deallocate it in
> > > > slab_mem_off_callback() when the node goes memoryless, so we can just
> > > > get rid of it.
> > > >
> > > > The only side effect is that we will stop clearing the node from slab_nodes.
> > > >
> > >
> > > Feel free to add a Suggested-by: if you think it applies.
> > >
> > >
> > > Do we have to take care of the N_NORMAL_MEMORY check in kmem_cache_init() ? Likely it
> > > would have to be a N_MEMORY check.
> > >
> > >
> > > But, I was wondering if we could get rid of the "slab_nodes" thingy as a first step?
> >
> > The following commit says that SLUB has slab_nodes thingy for a reason...
> > kmem_cache_node might not be ready yet even when N_NORMAL_MEMORY check
> > says it now has normal memory.
>
> node_states_set_node() is called from memory hotplug code after
> MEM_GOING_ONLINE and after online_pages_range().
>
> Pages might be isolated at that point, but node_states_set_node() is set
> only after the memory notifier (MEM_GOING_ONLINE) was triggered.

Oh right, didn't realize that. Thanks.

> So I don't immediately see the problem assuming that we never free the
> structures.
>
> But yeah, this is what I raised below: "Not sure if there are any races to
> consider" :)

At least on the slab side I don't see any races that need to be
addressed. Hopefully I didn't overlook anything :)

> > @Vlastimil maybe a dumb question but why not check s->node[nid]
> > instead of having slab_nodes bitmask?
> >
> > commit 7e1fa93deff44677a94dfc323ff629bbf5cf9360
> > Author: Vlastimil Babka <vbabka@xxxxxxx>
> > Date: Wed Feb 24 12:01:12 2021 -0800
> >
> > mm, slab, slub: stop taking memory hotplug lock
> > Since commit 03afc0e25f7f ("slab: get_online_mems for
> > kmem_cache_{create,destroy,shrink}") we are taking memory hotplug lock for
> > SLAB and SLUB when creating, destroying or shrinking a cache. It is quite
> > a heavy lock and it's best to avoid it if possible, as we had several
> > issues with lockdep complaining about ordering in the past, see e.g.
> > e4f8e513c3d3 ("mm/slub: fix a deadlock in show_slab_objects()").
> > The problem scenario in 03afc0e25f7f (solved by the memory hotplug lock)
> > can be summarized as follows: while there's slab_mutex synchronizing new
> > kmem cache creation and SLUB's MEM_GOING_ONLINE callback
> > slab_mem_going_online_callback(), we may miss creation of kmem_cache_node
> > for the hotplugged node in the new kmem cache, because the hotplug
> > callback doesn't yet see the new cache, and cache creation in
> > init_kmem_cache_nodes() only inits kmem_cache_node for nodes in the
> > N_NORMAL_MEMORY nodemask, which however may not yet include the new node,
> > as that happens only later after the MEM_GOING_ONLINE callback.
> > Instead of using get/put_online_mems(), the problem can be solved by SLUB
> > maintaining its own nodemask of nodes for which it has allocated the
> > per-node kmem_cache_node structures. This nodemask would generally mirror
> > the N_NORMAL_MEMORY nodemask, but would be updated only in under SLUB's
> > control in its memory hotplug callbacks under the slab_mutex. This patch
> > adds such nodemask and its handling.
> > Commit 03afc0e25f7f mentiones "issues like [the one above]", but there
> > don't appear to be further issues. All the paths (shared for SLAB and
> > SLUB) taking the memory hotplug locks are also taking the slab_mutex,
> > except kmem_cache_shrink() where 03afc0e25f7f replaced slab_mutex with
> > get/put_online_mems().
> > We however cannot simply restore slab_mutex in kmem_cache_shrink(), as
> > SLUB can enters the function from a write to sysfs 'shrink' file, thus
> > holding kernfs lock, and in kmem_cache_create() the kernfs lock is nested
> > within slab_mutex. But on closer inspection we don't actually need to
> > protect kmem_cache_shrink() from hotplug callbacks: While SLUB's
> > __kmem_cache_shrink() does for_each_kmem_cache_node(), missing a new node
> > added in parallel hotplug is not fatal, and parallel hotremove does not
> > free kmem_cache_node's anymore after the previous patch, so use-after free
> > cannot happen. The per-node shrinking itself is protected by
> > n->list_lock. Same is true for SLAB, and SLOB is no-op.
> > SLAB also doesn't need the memory hotplug locking, which it only gained by
> > 03afc0e25f7f through the shared paths in slab_common.c. Its memory
> > hotplug callbacks are also protected by slab_mutex against races with
> > these paths. The problem of SLUB relying on N_NORMAL_MEMORY doesn't apply
> > to SLAB, as its setup_kmem_cache_nodes relies on N_ONLINE, and the new
> > node is already set there during the MEM_GOING_ONLINE callback, so no
> > special care is needed for SLAB.
> > As such, this patch removes all get/put_online_mems() usage by the slab
> > subsystem.
> > Link: https://lkml.kernel.org/r/20210113131634.3671-3-vbabka@xxxxxxx
> > Signed-off-by: Vlastimil Babka <vbabka@xxxxxxx>
> > Cc: Christoph Lameter <cl@xxxxxxxxx>
> > Cc: David Hildenbrand <david@xxxxxxxxxx>
> > Cc: David Rientjes <rientjes@xxxxxxxxxx>
> > Cc: Joonsoo Kim <iamjoonsoo.kim@xxxxxxx>
> > Cc: Michal Hocko <mhocko@xxxxxxxxxx>
> > Cc: Pekka Enberg <penberg@xxxxxxxxxx>
> > Cc: Qian Cai <cai@xxxxxxxxxx>
> > Cc: Vladimir Davydov <vdavydov.dev@xxxxxxxxx>
> > Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> > Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> >
> > >
> > > From 518a2b83a9c5bd85d74ddabbc36ce5d181a88ed6 Mon Sep 17 00:00:00 2001
> > > From: David Hildenbrand <david@xxxxxxxxxx>
> > > Date: Tue, 8 Apr 2025 12:16:13 +0200
> > > Subject: [PATCH] tmp
> > >
> > > Signed-off-by: David Hildenbrand <david@xxxxxxxxxx>
> > > ---
> > > mm/slub.c | 56 ++++---------------------------------------------------
> > > 1 file changed, 4 insertions(+), 52 deletions(-)
> > >
> > > diff --git a/mm/slub.c b/mm/slub.c
> > > index b46f87662e71d..afe31149e7f4e 100644
> > > --- a/mm/slub.c
> > > +++ b/mm/slub.c
> > > @@ -445,14 +445,6 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
> > > for (__node = 0; __node < nr_node_ids; __node++) \
> > > if ((__n = get_node(__s, __node)))
> > > -/*
> > > - * Tracks for which NUMA nodes we have kmem_cache_nodes allocated.
> > > - * Corresponds to node_state[N_NORMAL_MEMORY], but can temporarily
> > > - * differ during memory hotplug/hotremove operations.
> > > - * Protected by slab_mutex.
> > > - */
> > > -static nodemask_t slab_nodes;
> > > -
> > > #ifndef CONFIG_SLUB_TINY
> > > /*
> > > * Workqueue used for flush_cpu_slab().
> > > @@ -3706,10 +3698,9 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> > > if (!slab) {
> > > /*
> > > * if the node is not online or has no normal memory, just
> > > - * ignore the node constraint
> > > + * ignore the node constraint.
> > > */
> > > - if (unlikely(node != NUMA_NO_NODE &&
> > > - !node_isset(node, slab_nodes)))
> > > + if (unlikely(node != NUMA_NO_NODE && !node_state(node, N_NORMAL_MEMORY)))
> > > node = NUMA_NO_NODE;
> > > goto new_slab;
> > > }
> > > @@ -3719,7 +3710,7 @@ static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> > > * same as above but node_match() being false already
> > > * implies node != NUMA_NO_NODE
> > > */
> > > - if (!node_isset(node, slab_nodes)) {
> > > + if (!node_state(node, N_NORMAL_MEMORY)) {
> > > node = NUMA_NO_NODE;
> > > } else {
> > > stat(s, ALLOC_NODE_MISMATCH);
> > > @@ -5623,7 +5614,7 @@ static int init_kmem_cache_nodes(struct kmem_cache *s)
> > > {
> > > int node;
> > > - for_each_node_mask(node, slab_nodes) {
> > > + for_each_node_state(node, N_NORMAL_MEMORY) {
> > > struct kmem_cache_node *n;
> > > if (slab_state == DOWN) {
> > > @@ -6164,30 +6155,6 @@ static int slab_mem_going_offline_callback(void *arg)
> > > return 0;
> > > }
> > > -static void slab_mem_offline_callback(void *arg)
> > > -{
> > > - struct memory_notify *marg = arg;
> > > - int offline_node;
> > > -
> > > - offline_node = marg->status_change_nid_normal;
> > > -
> > > - /*
> > > - * If the node still has available memory. we need kmem_cache_node
> > > - * for it yet.
> > > - */
> > > - if (offline_node < 0)
> > > - return;
> > > -
> > > - mutex_lock(&slab_mutex);
> > > - node_clear(offline_node, slab_nodes);
> > > - /*
> > > - * We no longer free kmem_cache_node structures here, as it would be
> > > - * racy with all get_node() users, and infeasible to protect them with
> > > - * slab_mutex.
> > > - */
> > > - mutex_unlock(&slab_mutex);
> > > -}
> > > -
> > > static int slab_mem_going_online_callback(void *arg)
> > > {
> > > struct kmem_cache_node *n;
> > > @@ -6229,11 +6196,6 @@ static int slab_mem_going_online_callback(void *arg)
> > > init_kmem_cache_node(n);
> > > s->node[nid] = n;
> > > }
> > > - /*
> > > - * Any cache created after this point will also have kmem_cache_node
> > > - * initialized for the new node.
> > > - */
> > > - node_set(nid, slab_nodes);
> > > out:
> > > mutex_unlock(&slab_mutex);
> > > return ret;
> > > @@ -6253,8 +6215,6 @@ static int slab_memory_callback(struct notifier_block *self,
> > > break;
> > > case MEM_OFFLINE:
> > > case MEM_CANCEL_ONLINE:
> > > - slab_mem_offline_callback(arg);
> > > - break;
> > > case MEM_ONLINE:
> > > case MEM_CANCEL_OFFLINE:
> > > break;
> > > @@ -6309,7 +6269,6 @@ void __init kmem_cache_init(void)
> > > {
> > > static __initdata struct kmem_cache boot_kmem_cache,
> > > boot_kmem_cache_node;
> > > - int node;
> > > if (debug_guardpage_minorder())
> > > slub_max_order = 0;
> > > @@ -6321,13 +6280,6 @@ void __init kmem_cache_init(void)
> > > kmem_cache_node = &boot_kmem_cache_node;
> > > kmem_cache = &boot_kmem_cache;
> > > - /*
> > > - * Initialize the nodemask for which we will allocate per node
> > > - * structures. Here we don't need taking slab_mutex yet.
> > > - */
> > > - for_each_node_state(node, N_NORMAL_MEMORY)
> > > - node_set(node, slab_nodes);
> > > -
> > > create_boot_cache(kmem_cache_node, "kmem_cache_node",
> > > sizeof(struct kmem_cache_node),
> > > SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0);
> > > --
> > > 2.48.1
> > >
> > >
> > > Not sure if there are any races to consider ... just an idea.
> > >
> > > --
> > > Cheers,
> > >
> > > David / dhildenb
> > >
> >
> --
> Cheers,
>
> David / dhildenb
>
>

--
Cheers,
Harry (Hyeonggon is my Korean name)