Re: [PATCH v4 5/7] mm: rework non-root kmem_cache lifecycle management

From: Roman Gushchin
Date: Tue May 21 2019 - 15:26:53 EST


On Tue, May 21, 2019 at 02:39:50PM -0400, Waiman Long wrote:
> On 5/14/19 8:06 PM, Shakeel Butt wrote:
> >> @@ -2651,20 +2652,35 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
> >> struct mem_cgroup *memcg;
> >> struct kmem_cache *memcg_cachep;
> >> int kmemcg_id;
> >> + struct memcg_cache_array *arr;
> >>
> >> VM_BUG_ON(!is_root_cache(cachep));
> >>
> >> if (memcg_kmem_bypass())
> >> return cachep;
> >>
> >> - memcg = get_mem_cgroup_from_current();
> >> + rcu_read_lock();
> >> +
> >> + if (unlikely(current->active_memcg))
> >> + memcg = current->active_memcg;
> >> + else
> >> + memcg = mem_cgroup_from_task(current);
> >> +
> >> + if (!memcg || memcg == root_mem_cgroup)
> >> + goto out_unlock;
> >> +
> >> kmemcg_id = READ_ONCE(memcg->kmemcg_id);
> >> if (kmemcg_id < 0)
> >> - goto out;
> >> + goto out_unlock;
> >>
> >> - memcg_cachep = cache_from_memcg_idx(cachep, kmemcg_id);
> >> - if (likely(memcg_cachep))
> >> - return memcg_cachep;
> >> + arr = rcu_dereference(cachep->memcg_params.memcg_caches);
> >> +
> >> + /*
> >> + * Make sure we will access the up-to-date value. The code updating
> >> + * memcg_caches issues a write barrier to match this (see
> >> + * memcg_create_kmem_cache()).
> >> + */
> >> + memcg_cachep = READ_ONCE(arr->entries[kmemcg_id]);
> >>
> >> /*
> >> * If we are in a safe context (can wait, and not in interrupt
> >> @@ -2677,10 +2693,20 @@ struct kmem_cache *memcg_kmem_get_cache(struct kmem_cache *cachep)
> >> * memcg_create_kmem_cache, this means no further allocation
> >> * could happen with the slab_mutex held. So it's better to
> >> * defer everything.
> >> + *
> >> + * If the memcg is dying or memcg_cache is about to be released,
> >> + * don't bother creating new kmem_caches. Because memcg_cachep
> >> + * is ZEROed as the fist step of kmem offlining, we don't need
> >> + * percpu_ref_tryget() here. css_tryget_online() check in
> > *percpu_ref_tryget_live()
> >
> >> + * memcg_schedule_kmem_cache_create() will prevent us from
> >> + * creation of a new kmem_cache.
> >> */
> >> - memcg_schedule_kmem_cache_create(memcg, cachep);
> >> -out:
> >> - css_put(&memcg->css);
> >> + if (unlikely(!memcg_cachep))
> >> + memcg_schedule_kmem_cache_create(memcg, cachep);
> >> + else if (percpu_ref_tryget(&memcg_cachep->memcg_params.refcnt))
> >> + cachep = memcg_cachep;
> >> +out_unlock:
> >> + rcu_read_lock();
>
> There is one more bug that causes the kernel to panic on bootup when I
> turned on debugging options.
>
> [   49.871437] =============================
> [   49.875452] WARNING: suspicious RCU usage
> [   49.879476] 5.2.0-rc1.bz1699202_memcg_test+ #2 Not tainted
> [   49.884967] -----------------------------
> [   49.888991] include/linux/rcupdate.h:268 Illegal context switch in
> RCU read-side critical section!
> [   49.897950]
> [   49.897950] other info that might help us debug this:
> [   49.897950]
> [   49.905958]
> [   49.905958] rcu_scheduler_active = 2, debug_locks = 1
> [   49.912492] 3 locks held by systemd/1:
> [   49.916252]  #0: 00000000633673c5 (&type->i_mutex_dir_key#5){.+.+},
> at: lookup_slow+0x42/0x70
> [   49.924788]  #1: 0000000029fa8c75 (rcu_read_lock){....}, at:
> memcg_kmem_get_cache+0x12b/0x910
> [   49.933316]  #2: 0000000029fa8c75 (rcu_read_lock){....}, at:
> memcg_kmem_get_cache+0x3da/0x910
>
> It should be "rcu_read_unlock();" at the end.

Oops. Good catch, thanks Waiman!

I'm somewhat surprised it didn't get up in my tests, neither any of test
bots caught it. Anyway, I'll fix it and send v5.

Does the rest of the patchset looks sane to you?

Thank you!

Roman