[PATCH mm-unstable v2 0/8] mm: multi-gen LRU: memcg LRU

From: Yu Zhao
Date: Tue Dec 20 2022 - 19:12:54 EST


What's new
==========
1. Rebased to the latest mm-unstable.
2. Added two comprehensive benchmarks:
https://lore.kernel.org/r/20221220214923.1229538-1-yuzhao@xxxxxxxxxx/
https://lore.kernel.org/r/20221221000748.1374772-1-yuzhao@xxxxxxxxxx/

Overview
========
An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs,
since each node and memcg combination has an LRU of folios (see
mem_cgroup_lruvec()).

Its goal is to improve the scalability of global reclaim, which is
critical to system-wide memory overcommit in data centers. Note that
memcg reclaim is currently out of scope.

Its memory bloat is a pointer to each lruvec and negligible to each
pglist_data. In terms of traversing memcgs during global reclaim, it
improves the best-case complexity from O(n) to O(1) and does not
affect the worst-case complexity O(n). Therefore, on average, it has
a sublinear complexity in contrast to the current linear complexity.

The basic structure of an memcg LRU can be understood by an analogy
to the active/inactive LRU (of folios):
1. It has the young and the old (generations), the counterparts to
the active and the inactive;
2. The increment of max_seq triggers promotion, the counterpart to
activation;
3. Other events, e.g., offlining an memcg, triggers similar
operations, e.g., demotion, the counterpart to deactivation.

In terms of global reclaim, it has two distinct features:
1. Sharding, which allows each thread to start at a random memcg (in
the old generation) and improves parallelism;
2. Eventual fairness, which allows direct reclaim to bail out and
reduces latency without affecting fairness over some time.

The commit message in patch 6 details the workflow:
https://lore.kernel.org/r/20221221001207.1376119-7-yuzhao@xxxxxxxxxx/

The following is a simple test to quickly verify its effectiveness.
More benchmarks are coming soon.

Test design:
1. Create multiple memcgs.
2. Each memcg contains a job (fio).
3. All jobs access the same amount of memory randomly.
4. The system does not experience global memory pressure.
5. Periodically write to the root memory.reclaim.

Desired outcome:
1. All memcgs have similar pgsteal counts, i.e.,
stddev(pgsteal)/mean(pgsteal) is close to 0%.
2. The total pgsteal is close to the total requested through
memory.reclaim, i.e., sum(pgsteal)/sum(requested) is close to
100%.

Actual outcome [1]:
MGLRU off MGLRU on
stddev(pgsteal) / mean(pgsteal) 75% 20%
sum(pgsteal) / sum(requested) 425% 95%
####################################################################
MEMCGS=128

for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
mkdir /sys/fs/cgroup/memcg$memcg
done

start() {
echo $BASHPID > /sys/fs/cgroup/memcg$memcg/cgroup.procs

fio -name=memcg$memcg --numjobs=1 --ioengine=mmap \
--filename=/dev/zero --size=1920M --rw=randrw \
--rate=64m,64m --random_distribution=random \
--fadvise_hint=0 --time_based --runtime=10h \
--group_reporting --minimal
}

for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
start &
done

sleep 600

for ((i = 0; i < 600; i++)); do
echo 256m >/sys/fs/cgroup/memory.reclaim
sleep 6
done

for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
grep "pgsteal " /sys/fs/cgroup/memcg$memcg/memory.stat
done
####################################################################

[1]: This was obtained from running the above script (touches less
than 256GB memory) on an EPYC 7B13 with 512GB DRAM for over an
hour.

Yu Zhao (8):
mm: multi-gen LRU: rename lru_gen_struct to lru_gen_folio
mm: multi-gen LRU: rename lrugen->lists[] to lrugen->folios[]
mm: multi-gen LRU: remove eviction fairness safeguard
mm: multi-gen LRU: remove aging fairness safeguard
mm: multi-gen LRU: shuffle should_run_aging()
mm: multi-gen LRU: per-node lru_gen_folio lists
mm: multi-gen LRU: clarify scan_control flags
mm: multi-gen LRU: simplify arch_has_hw_pte_young() check

Documentation/mm/multigen_lru.rst | 8 +-
include/linux/memcontrol.h | 10 +
include/linux/mm_inline.h | 25 +-
include/linux/mmzone.h | 131 ++++-
mm/memcontrol.c | 16 +
mm/page_alloc.c | 1 +
mm/vmscan.c | 768 ++++++++++++++++++++----------
mm/workingset.c | 4 +-
8 files changed, 692 insertions(+), 271 deletions(-)

--
2.39.0.314.g84b9a713c41-goog