[PATCH v2 0/6] mm: reduce the memory footprint of dying memory cgroups

From: Roman Gushchin
Date: Tue Mar 12 2019 - 18:34:11 EST


A cgroup can remain in the dying state for a long time, being pinned in the
memory by any kernel object. It can be pinned by a page, shared with other
cgroup (e.g. mlocked by a process in the other cgroup). It can be pinned
by a vfs cache object, etc.

Mostly because of percpu data, the size of a memcg structure in the kernel
memory is quite large. Depending on the machine size and the kernel config,
it can easily reach hundreds of kilobytes per cgroup.

Depending on the memory pressure and the reclaim approach (which is a separate
topic), it looks like several hundreds (if not single thousands) of dying
cgroups is a typical number. On a moderately sized machine the overall memory
footprint is measured in hundreds of megabytes.

So if we can't completely get rid of dying cgroups, let's make them smaller.
This patchset aims to reduce the size of a dying memory cgroup by the premature
release of percpu data during the cgroup removal, and use of atomic counterparts
instead. Currently it covers per-memcg vmstat_percpu, per-memcg per-node
lruvec_stat_cpu. The same approach can be further applied to other percpu data.

Results on my test machine (32 CPUs, singe node):

With the patchset: Originally:

nr_dying_descendants 0
Slab: 66640 kB Slab: 67644 kB
Percpu: 6912 kB Percpu: 6912 kB

nr_dying_descendants 1000
Slab: 85912 kB Slab: 84704 kB
Percpu: 26880 kB Percpu: 64128 kB

So one dying cgroup went from 75 kB to 39 kB, which is almost twice smaller.
The difference will be even bigger on a bigger machine
(especially, with NUMA).

To test the patchset, I used the following script:
CG=/sys/fs/cgroup/percpu_test/

mkdir ${CG}
echo "+memory" > ${CG}/cgroup.subtree_control

cat ${CG}/cgroup.stat | grep nr_dying_descendants
cat /proc/meminfo | grep -e Percpu -e Slab

for i in `seq 1 1000`; do
mkdir ${CG}/${i}
echo $$ > ${CG}/${i}/cgroup.procs
dd if=/dev/urandom of=/tmp/test-${i} count=1 2> /dev/null
echo $$ > /sys/fs/cgroup/cgroup.procs
rmdir ${CG}/${i}
done

cat /sys/fs/cgroup/cgroup.stat | grep nr_dying_descendants
cat /proc/meminfo | grep -e Percpu -e Slab

rmdir ${CG}


v2:
- several renamings suggested by Johannes Weiner
- added a patch, which merges cpu offlining and percpu flush code


Roman Gushchin (6):
mm: prepare to premature release of memcg->vmstats_percpu
mm: prepare to premature release of per-node lruvec_stat_cpu
mm: release memcg percpu data prematurely
mm: release per-node memcg percpu data prematurely
mm: flush memcg percpu stats and events before releasing
mm: refactor memcg_hotplug_cpu_dead() to use
memcg_flush_offline_percpu()

include/linux/memcontrol.h | 66 ++++++++++----
mm/memcontrol.c | 179 ++++++++++++++++++++++++++++---------
2 files changed, 186 insertions(+), 59 deletions(-)

--
2.20.1