[PATCH mm-unstable v1 0/8] mm: multi-gen LRU: memcg LRU
From: Yu Zhao
Date: Thu Dec 01 2022 - 17:39:45 EST
An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs,
since each node and memcg combination has an LRU of folios (see
mem_cgroup_lruvec()).
Its goal is to improve the scalability of global reclaim, which is
critical to systemwide memory overcommit in data centers. Note that
memcg reclaim is currently out of scope.
Its memory bloat is a pointer to each LRU vector and negligible to
each node. In terms of traversing memcgs during global reclaim, it
improves the best-case complexity from O(n) to O(1) and does not
affect the worst-case complexity O(n). Therefore, on average, it has
a sublinear complexity in contrast to the current linear complexity.
The basic structure of an memcg LRU can be understood by an analogy
to the active/inactive LRU (of folios):
1. It has the young and the old (generations);
2. Its linked lists have the head and the tail;
3. The increment of max_seq triggers promotion;
4. Other events, e.g., offlining an memcg, triggers similar
operations.
In terms of global reclaim, it has two distinct features:
1. Sharding, which allows each thread to start at a random memcg (in
the old generation) and improves parallelism;
2. Eventual fairness, which allows direct reclaim to bail out and
reduces latency without affecting fairness over some time.
The commit message in patch 6 details the workflow:
https://lore.kernel.org/r/20221201223923.873696-7-yuzhao@xxxxxxxxxx/
The following is a simple test to quickly verify its effectiveness.
More benchmarks are coming soon.
Test design:
1. Create multiple memcgs.
2. Each memcg contains a job (fio).
3. All jobs access the same amount of memory randomly.
4. The system does not experience global memory pressure.
5. Periodically write to the root memory.reclaim.
Desired outcome:
1. All memcgs have similar pgsteal, i.e.,
stddev(pgsteal)/mean(pgsteal) is close to 0%.
2. The total pgsteal is close to the total requested through
memory.reclaim, i.e., sum(pgsteal)/sum(requested) is close to
100%.
Actual outcome [1]:
stddev(pgsteal)/mean(pgsteal) sum(pgsteal)/sum(requested)
MGLRU off 75% 425%
MGLRU on 20% 95%
####################################################################
MEMCGS=128
for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
mkdir /sys/fs/cgroup/memcg$memcg
done
start() {
echo $BASHPID > /sys/fs/cgroup/memcg$memcg/cgroup.procs
fio -name=memcg$memcg --numjobs=1 --ioengine=mmap \
--filename=/dev/zero --size=1920M --rw=randrw \
--rate=64m,64m --random_distribution=random \
--fadvise_hint=0 --time_based --runtime=10h \
--group_reporting --minimal
}
for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
start &
done
sleep 600
for ((i = 0; i < 600; i++)); do
echo 256m >/sys/fs/cgroup/memory.reclaim
sleep 6
done
for ((memcg = 0; memcg < $MEMCGS; memcg++)); do
grep "pgsteal " /sys/fs/cgroup/memcg$memcg/memory.stat
done
####################################################################
[1]: This was obtained from running the above script (touches less
than 256GB memory) on an EPYC 7B13 with 512GB DRAM for over an
hour.
Yu Zhao (8):
mm: multi-gen LRU: rename lru_gen_struct to lru_gen_folio
mm: multi-gen LRU: rename lrugen->lists[] to lrugen->folios[]
mm: multi-gen LRU: remove eviction fairness safeguard
mm: multi-gen LRU: remove aging fairness safeguard
mm: multi-gen LRU: shuffle should_run_aging()
mm: multi-gen LRU: per-node lru_gen_folio lists
mm: multi-gen LRU: clarify scan_control flags
mm: multi-gen LRU: simplify arch_has_hw_pte_young() check
Documentation/mm/multigen_lru.rst | 8 +-
include/linux/memcontrol.h | 10 +
include/linux/mm_inline.h | 25 +-
include/linux/mmzone.h | 127 ++++-
mm/memcontrol.c | 16 +
mm/page_alloc.c | 1 +
mm/vmscan.c | 765 ++++++++++++++++++++----------
mm/workingset.c | 4 +-
8 files changed, 687 insertions(+), 269 deletions(-)
--
2.39.0.rc0.267.gcb52ba06e7-goog