Re: [PATCH v2] mm, memcg: Add a memcg_slabinfo debugfs file

From: Waiman Long
Date: Thu Jun 20 2019 - 10:54:21 EST


On 6/20/19 10:39 AM, Shakeel Butt wrote:
> On Thu, Jun 20, 2019 at 7:24 AM Waiman Long <longman@xxxxxxxxxx> wrote:
>> On 6/19/19 7:48 PM, Shakeel Butt wrote:
>>> Hi Waiman,
>>>
>>> On Wed, Jun 19, 2019 at 10:16 AM Waiman Long <longman@xxxxxxxxxx> wrote:
>>>> There are concerns about memory leaks from extensive use of memory
>>>> cgroups as each memory cgroup creates its own set of kmem caches. There
>>>> is a possiblity that the memcg kmem caches may remain even after the
>>>> memory cgroups have been offlined. Therefore, it will be useful to show
>>>> the status of each of memcg kmem caches.
>>>>
>>>> This patch introduces a new <debugfs>/memcg_slabinfo file which is
>>>> somewhat similar to /proc/slabinfo in format, but lists only information
>>>> about kmem caches that have child memcg kmem caches. Information
>>>> available in /proc/slabinfo are not repeated in memcg_slabinfo.
>>>>
>>>> A portion of a sample output of the file was:
>>>>
>>>> # <name> <css_id[:dead]> <active_objs> <num_objs> <active_slabs> <num_slabs>
>>>> rpc_inode_cache root 13 51 1 1
>>>> rpc_inode_cache 48 0 0 0 0
>>>> fat_inode_cache root 1 45 1 1
>>>> fat_inode_cache 41 2 45 1 1
>>>> xfs_inode root 770 816 24 24
>>>> xfs_inode 92 22 34 1 1
>>>> xfs_inode 88:dead 1 34 1 1
>>>> xfs_inode 89:dead 23 34 1 1
>>>> xfs_inode 85 4 34 1 1
>>>> xfs_inode 84 9 34 1 1
>>>>
>>>> The css id of the memcg is also listed. If a memcg is not online,
>>>> the tag ":dead" will be attached as shown above.
>>>>
>>>> Suggested-by: Shakeel Butt <shakeelb@xxxxxxxxxx>
>>>> Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
>>>> ---
>>>> mm/slab_common.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++
>>>> 1 file changed, 57 insertions(+)
>>>>
>>>> diff --git a/mm/slab_common.c b/mm/slab_common.c
>>>> index 58251ba63e4a..2bca1558a722 100644
>>>> --- a/mm/slab_common.c
>>>> +++ b/mm/slab_common.c
>>>> @@ -17,6 +17,7 @@
>>>> #include <linux/uaccess.h>
>>>> #include <linux/seq_file.h>
>>>> #include <linux/proc_fs.h>
>>>> +#include <linux/debugfs.h>
>>>> #include <asm/cacheflush.h>
>>>> #include <asm/tlbflush.h>
>>>> #include <asm/page.h>
>>>> @@ -1498,6 +1499,62 @@ static int __init slab_proc_init(void)
>>>> return 0;
>>>> }
>>>> module_init(slab_proc_init);
>>>> +
>>>> +#if defined(CONFIG_DEBUG_FS) && defined(CONFIG_MEMCG_KMEM)
>>>> +/*
>>>> + * Display information about kmem caches that have child memcg caches.
>>>> + */
>>>> +static int memcg_slabinfo_show(struct seq_file *m, void *unused)
>>>> +{
>>>> + struct kmem_cache *s, *c;
>>>> + struct slabinfo sinfo;
>>>> +
>>>> + mutex_lock(&slab_mutex);
>>> On large machines there can be thousands of memcgs and potentially
>>> each memcg can have hundreds of kmem caches. So, the slab_mutex can be
>>> held for a very long time.
>> But that is also what /proc/slabinfo does by doing mutex_lock() at
>> slab_start() and mutex_unlock() at slab_stop(). So the same problem will
>> happen when /proc/slabinfo is being read.
>>
>> When you are in a situation that reading /proc/slabinfo take a long time
>> because of the large number of memcg's, the system is in some kind of
>> trouble anyway. I am saying that we should not improve the scalability
>> of this patch. It is just that some nasty race conditions may pop up if
>> we release the lock and re-acquire it latter. That will greatly
>> complicate the code to handle all those edge cases.
>>
> We have been using that interface and implementation for couple of
> years and have not seen any race condition. However I am fine with
> what you have here for now. We can always come back if we think we
> need to improve it.
>
>>> Our internal implementation traverses the memcg tree and then
>>> traverses 'memcg->kmem_caches' within the slab_mutex (and
>>> cond_resched() after unlock).
>> For cgroup v1, the setting of the CONFIG_SLUB_DEBUG option will allow
>> you to iterate and display slabinfo just for that particular memcg. I am
>> thinking of extending the debug controller to do similar thing for
>> cgroup v2.
> I was also planning to look into that and it seems like you are
> already on it. Do CC me the patches.
>
Sure.


>>>> + seq_puts(m, "# <name> <css_id[:dead]> <active_objs> <num_objs>");
>>>> + seq_puts(m, " <active_slabs> <num_slabs>\n");
>>>> + list_for_each_entry(s, &slab_root_caches, root_caches_node) {
>>>> + /*
>>>> + * Skip kmem caches that don't have any memcg children.
>>>> + */
>>>> + if (list_empty(&s->memcg_params.children))
>>>> + continue;
>>>> +
>>>> + memset(&sinfo, 0, sizeof(sinfo));
>>>> + get_slabinfo(s, &sinfo);
>>>> + seq_printf(m, "%-17s root %6lu %6lu %6lu %6lu\n",
>>>> + cache_name(s), sinfo.active_objs, sinfo.num_objs,
>>>> + sinfo.active_slabs, sinfo.num_slabs);
>>>> +
>>>> + for_each_memcg_cache(c, s) {
>>>> + struct cgroup_subsys_state *css;
>>>> + char *dead = "";
>>>> +
>>>> + css = &c->memcg_params.memcg->css;
>>>> + if (!(css->flags & CSS_ONLINE))
>>>> + dead = ":dead";
>>> Please note that Roman's kmem cache reparenting patch series have made
>>> kmem caches of zombie memcgs a bit tricky. On memcg offlining the
>>> memcg kmem caches are reparented and the css->id can get recycled. So,
>>> we want to know that the a kmem cache is reparented and which memcg it
>>> belonged to initially. Determining if a kmem cache is reparented, we
>>> can store a flag on the kmem cache and for the previous memcg we can
>>> use fhandle. However to not make this more complicated, for now, we
>>> can just have the info that the kmem cache was reparented i.e. belongs
>>> to an offlined memcg.
>> I need to play with Roman's kmem cache reparenting patch a bit more to
>> see how to properly recognize a reparent'ed kmem cache. What I have
>> noticed is that the dead kmem caches that I saw at boot up were gone
>> after applying his patch. So that is a good thing.
>>
> By gone, do you mean the kmem cache got freed or the kmem cache is not
> part of online parent memcg and thus no more dead kmem cache?
I just look at the online flag of the memcg's css. All of them are
online when the iteration is being done after Roman's patch. I will
probably need to check if reparenting has happened.
>
>> For now, I think the current patch is good enough for its purpose. I may
>> send follow-up if I see something that can be improved.
>>
> I would like to see the recognition of reparent'ed kmem cache in this
> patch. However if others are ok with the current status of the patch
> then I will not stand in the way.

As I said, I will work on a follow-up patch to recognize reparenting.

Cheers,
Longman