Re: [patch] slab: add memory hotplug support

From: Nick Piggin
Date: Tue Mar 09 2010 - 08:46:50 EST


On Mon, Mar 08, 2010 at 03:19:48PM -0800, David Rientjes wrote:
> On Fri, 5 Mar 2010, Nick Piggin wrote:
>
> > > +#if defined(CONFIG_NUMA) && defined(CONFIG_MEMORY_HOTPLUG)
> > > +/*
> > > + * Drains and frees nodelists for a node on each slab cache, used for memory
> > > + * hotplug. Returns -EBUSY if all objects cannot be drained on memory
> > > + * hot-remove so that the node is not removed. When used because memory
> > > + * hot-add is canceled, the only result is the freed kmem_list3.
> > > + *
> > > + * Must hold cache_chain_mutex.
> > > + */
> > > +static int __meminit free_cache_nodelists_node(int node)
> > > +{
> > > + struct kmem_cache *cachep;
> > > + int ret = 0;
> > > +
> > > + list_for_each_entry(cachep, &cache_chain, next) {
> > > + struct array_cache *shared;
> > > + struct array_cache **alien;
> > > + struct kmem_list3 *l3;
> > > +
> > > + l3 = cachep->nodelists[node];
> > > + if (!l3)
> > > + continue;
> > > +
> > > + spin_lock_irq(&l3->list_lock);
> > > + shared = l3->shared;
> > > + if (shared) {
> > > + free_block(cachep, shared->entry, shared->avail, node);
> > > + l3->shared = NULL;
> > > + }
> > > + alien = l3->alien;
> > > + l3->alien = NULL;
> > > + spin_unlock_irq(&l3->list_lock);
> > > +
> > > + if (alien) {
> > > + drain_alien_cache(cachep, alien);
> > > + free_alien_cache(alien);
> > > + }
> > > + kfree(shared);
> > > +
> > > + drain_freelist(cachep, l3, l3->free_objects);
> > > + if (!list_empty(&l3->slabs_full) ||
> > > + !list_empty(&l3->slabs_partial)) {
> > > + /*
> > > + * Continue to iterate through each slab cache to free
> > > + * as many nodelists as possible even though the
> > > + * offline will be canceled.
> > > + */
> > > + ret = -EBUSY;
> > > + continue;
> > > + }
> > > + kfree(l3);
> > > + cachep->nodelists[node] = NULL;
> >
> > What's stopping races of other CPUs trying to access l3 and array
> > caches while they're being freed?
> >
>
> numa_node_id() will not return an offlined nodeid and cache_alloc_node()
> already does a fallback to other onlined nodes in case a nodeid is passed
> to kmalloc_node() that does not have a nodelist. l3->shared and l3->alien
> cannot be accessed without l3->list_lock (drain, cache_alloc_refill,
> cache_flusharray) or cache_chain_mutex (kmem_cache_destroy, cache_reap).

Yeah, but can't it _have_ a nodelist (ie. before it is set to NULL here)
while it is being accessed by another CPU and concurrently being freed
on this one?


> > > + }
> > > + return ret;
> > > +}
> > > +
> > > +/*
> > > + * Onlines nid either as the result of memory hot-add or canceled hot-remove.
> > > + */
> > > +static int __meminit slab_node_online(int nid)
> > > +{
> > > + int ret;
> > > + mutex_lock(&cache_chain_mutex);
> > > + ret = init_cache_nodelists_node(nid);
> > > + mutex_unlock(&cache_chain_mutex);
> > > + return ret;
> > > +}
> > > +
> > > +/*
> > > + * Offlines nid either as the result of memory hot-remove or canceled hot-add.
> > > + */
> > > +static int __meminit slab_node_offline(int nid)
> > > +{
> > > + int ret;
> > > + mutex_lock(&cache_chain_mutex);
> > > + ret = free_cache_nodelists_node(nid);
> > > + mutex_unlock(&cache_chain_mutex);
> > > + return ret;
> > > +}
> > > +
> > > +static int __meminit slab_memory_callback(struct notifier_block *self,
> > > + unsigned long action, void *arg)
> > > +{
> > > + struct memory_notify *mnb = arg;
> > > + int ret = 0;
> > > + int nid;
> > > +
> > > + nid = mnb->status_change_nid;
> > > + if (nid < 0)
> > > + goto out;
> > > +
> > > + switch (action) {
> > > + case MEM_GOING_ONLINE:
> > > + case MEM_CANCEL_OFFLINE:
> > > + ret = slab_node_online(nid);
> > > + break;
> >
> > This would explode if CANCEL_OFFLINE fails. Call it theoretical and
> > put a panic() in here and I don't mind. Otherwise you get corruption
> > somewhere in the slab code.
> >
>
> MEM_CANCEL_ONLINE would only fail here if a struct kmem_list3 couldn't be
> allocated anywhere on the system and if that happens then the node simply
> couldn't be allocated from (numa_node_id() would never return it as the
> cpu's node, so it's possible to fallback in this scenario).

Why would it never return the CPU's node? It's CANCEL_OFFLINE that is
the problem.


> Instead of doing this all at MEM_GOING_OFFLINE, we could delay freeing of
> the array caches and the nodelist until MEM_OFFLINE. We're guaranteed
> that all pages are freed at that point so there are no existing objects
> that we need to track and then if the offline fails from a different
> callback it would be possible to reset the l3->nodelists[node] pointers
> since they haven't been freed yet.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/