Re: [PATCH 2/2] mm, slab: Extend vm/drop_caches to shrink kmem slabs

From: Roman Gushchin
Date: Wed Jun 26 2019 - 16:20:17 EST


On Mon, Jun 24, 2019 at 01:42:19PM -0400, Waiman Long wrote:
> With the slub memory allocator, the numbers of active slab objects
> reported in /proc/slabinfo are not real because they include objects
> that are held by the per-cpu slab structures whether they are actually
> used or not. The problem gets worse the more CPUs a system have. For
> instance, looking at the reported number of active task_struct objects,
> one will wonder where all the missing tasks gone.
>
> I know it is hard and costly to get a real count of active objects. So
> I am not advocating for that. Instead, this patch extends the
> /proc/sys/vm/drop_caches sysctl parameter by using a new bit (bit 3)
> to shrink all the kmem slabs which will flush out all the slabs in the
> per-cpu structures and give a more accurate view of how much memory are
> really used up by the active slab objects. This is a costly operation,
> of course, but it gives a way to have a clearer picture of the actual
> number of slab objects used, if the need arises.
>
> The upper range of the drop_caches sysctl parameter is increased to 15
> to allow all possible combinations of the lowest 4 bits.
>
> On a 2-socket 64-core 256-thread ARM64 system with 64k page size after
> a parallel kernel build, the amount of memory occupied by slabs before
> and after echoing to drop_caches were:
>
> # grep task_struct /proc/slabinfo
> task_struct 48376 48434 4288 61 4 : tunables 0 0
> 0 : slabdata 794 794 0
> # grep "^S[lRU]" /proc/meminfo
> Slab: 3419072 kB
> SReclaimable: 354688 kB
> SUnreclaim: 3064384 kB
> # echo 3 > /proc/sys/vm/drop_caches
> # grep "^S[lRU]" /proc/meminfo
> Slab: 3351680 kB
> SReclaimable: 316096 kB
> SUnreclaim: 3035584 kB
> # echo 8 > /proc/sys/vm/drop_caches
> # grep "^S[lRU]" /proc/meminfo
> Slab: 1008192 kB
> SReclaimable: 126912 kB
> SUnreclaim: 881280 kB
> # grep task_struct /proc/slabinfo
> task_struct 2601 6588 4288 61 4 : tunables 0 0
> 0 : slabdata 108 108 0
>
> Shrinking the slabs saves more than 2GB of memory in this case. This
> new feature certainly fulfills the promise of dropping caches.
>
> Unlike counting objects in the per-node caches done by /proc/slabinfo
> which is rather light weight, iterating all the per-cpu caches and
> shrinking them is much more heavy weight.
>
> For this particular instance, the time taken to shrinks all the root
> caches was about 30.2ms. There were 73 memory cgroup and the longest
> time taken for shrinking the largest one was about 16.4ms. The total
> shrinking time was about 101ms.
>
> Because of the potential long time to shrinks all the caches, the
> slab_mutex was taken multiple times - once for all the root caches
> and once for each memory cgroup. This is to reduce the slab_mutex hold
> time to minimize impact to other running applications that may need to
> acquire the mutex.
>
> The slab shrinking feature is only available when CONFIG_MEMCG_KMEM is
> defined as the code need to access slab_root_caches to iterate all the
> root caches.
>
> Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
> ---
> Documentation/sysctl/vm.txt | 11 ++++++++--
> fs/drop_caches.c | 4 ++++
> include/linux/slab.h | 1 +
> kernel/sysctl.c | 4 ++--
> mm/slab_common.c | 44 +++++++++++++++++++++++++++++++++++++
> 5 files changed, 60 insertions(+), 4 deletions(-)
>
> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
> index 749322060f10..b643ac8968d2 100644
> --- a/Documentation/sysctl/vm.txt
> +++ b/Documentation/sysctl/vm.txt
> @@ -207,8 +207,8 @@ Setting this to zero disables periodic writeback altogether.
> drop_caches
>
> Writing to this will cause the kernel to drop clean caches, as well as
> -reclaimable slab objects like dentries and inodes. Once dropped, their
> -memory becomes free.
> +reclaimable slab objects like dentries and inodes. It can also be used
> +to shrink the slabs. Once dropped, their memory becomes free.
>
> To free pagecache:
> echo 1 > /proc/sys/vm/drop_caches
> @@ -216,6 +216,8 @@ To free reclaimable slab objects (includes dentries and inodes):
> echo 2 > /proc/sys/vm/drop_caches
> To free slab objects and pagecache:
> echo 3 > /proc/sys/vm/drop_caches
> +To shrink the slabs:
> + echo 8 > /proc/sys/vm/drop_caches
>
> This is a non-destructive operation and will not free any dirty objects.
> To increase the number of objects freed by this operation, the user may run
> @@ -223,6 +225,11 @@ To increase the number of objects freed by this operation, the user may run
> number of dirty objects on the system and create more candidates to be
> dropped.
>
> +Shrinking the slabs can reduce the memory footprint used by the slabs.
> +It also makes the number of active objects reported in /proc/slabinfo
> +more representative of the actual number of objects used for the slub
> +memory allocator.
> +
> This file is not a means to control the growth of the various kernel caches
> (inodes, dentries, pagecache, etc...) These objects are automatically
> reclaimed by the kernel when memory is needed elsewhere on the system.
> diff --git a/fs/drop_caches.c b/fs/drop_caches.c
> index d31b6c72b476..633b99e25dab 100644
> --- a/fs/drop_caches.c
> +++ b/fs/drop_caches.c
> @@ -9,6 +9,7 @@
> #include <linux/writeback.h>
> #include <linux/sysctl.h>
> #include <linux/gfp.h>
> +#include <linux/slab.h>
> #include "internal.h"
>
> /* A global variable is a bit ugly, but it keeps the code simple */
> @@ -65,6 +66,9 @@ int drop_caches_sysctl_handler(struct ctl_table *table, int write,
> drop_slab();
> count_vm_event(DROP_SLAB);
> }
> + if (sysctl_drop_caches & 8) {
> + kmem_cache_shrink_all();
> + }
> if (!stfu) {
> pr_info("%s (%d): drop_caches: %d\n",
> current->comm, task_pid_nr(current),
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 9449b19c5f10..f7c1626b2aa6 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -149,6 +149,7 @@ struct kmem_cache *kmem_cache_create_usercopy(const char *name,
> void (*ctor)(void *));
> void kmem_cache_destroy(struct kmem_cache *);
> int kmem_cache_shrink(struct kmem_cache *);
> +void kmem_cache_shrink_all(void);
>
> void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
> void memcg_deactivate_kmem_caches(struct mem_cgroup *);
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 1beca96fb625..feeb867dabd7 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -129,7 +129,7 @@ static int __maybe_unused neg_one = -1;
> static int zero;
> static int __maybe_unused one = 1;
> static int __maybe_unused two = 2;
> -static int __maybe_unused four = 4;
> +static int __maybe_unused fifteen = 15;
> static unsigned long zero_ul;
> static unsigned long one_ul = 1;
> static unsigned long long_max = LONG_MAX;
> @@ -1455,7 +1455,7 @@ static struct ctl_table vm_table[] = {
> .mode = 0644,
> .proc_handler = drop_caches_sysctl_handler,
> .extra1 = &one,
> - .extra2 = &four,
> + .extra2 = &fifteen,
> },
> #ifdef CONFIG_COMPACTION
> {
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 58251ba63e4a..b3c5b64f9bfb 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -956,6 +956,50 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
> }
> EXPORT_SYMBOL(kmem_cache_shrink);

Hi Waiman!

>
> +#ifdef CONFIG_MEMCG_KMEM
> +static void kmem_cache_shrink_memcg(struct mem_cgroup *memcg,
> + void __maybe_unused *arg)
> +{
> + struct kmem_cache *s;
> +
> + if (memcg == root_mem_cgroup)
> + return;
> + mutex_lock(&slab_mutex);
> + list_for_each_entry(s, &memcg->kmem_caches,
> + memcg_params.kmem_caches_node) {
> + kmem_cache_shrink(s);
> + }
> + mutex_unlock(&slab_mutex);
> + cond_resched();
> +}

A couple of questions:
1) how about skipping already offlined kmem_caches? They are already shrunk,
so you probably won't get much out of them. Or isn't it true?
2) what's your long-term vision here? do you think that we need to shrink
kmem_caches periodically, depending on memory pressure? how a user
will use this new sysctl?

What's the problem you're trying to solve in general?

Thanks!

Roman