Re: [PATCH v2 00/16] Multigenerational LRU Framework

From: Konstantin Kharlamov
Date: Thu Apr 29 2021 - 20:03:18 EST


In case you need it yet, this series is:

Tested-by: Konstantin Kharlamov <Hi-Angel@xxxxxxxxx>

My success story: I have Archlinux with 8G RAM + zswap + swap. While developing,
I have lots of apps opened such as multiple LSP-servers for different langs,
chats, two browsers, etc… Usually, my system gets quickly to a point of SWAP-
storms, where I have to kill LSP-servers, restart browsers to free memory, etc,
otherwise the system lags heavily and is barely usable.

1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU patchset, and I
started up by opening lots of apps to create memory pressure, and worked for a
day like this. Till now I had *not a single SWAP-storm*, and mind you I got 3.4G
in SWAP. I was never getting to the point of 3G in SWAP before without a single
SWAP-storm.

Right now my gf on Fedora 33 also suffers from SWAP-storms on her old Macbook
2013 with 4G RAM + zswap + swap, I think the next week I'll build for her 5.12 +
LRU patchset as well. Will see how it goes, I expect it will improve her
experience by a lot too.

P.S.: upon replying please keep me CCed, I'm not subscribed to the list

On Tue, 2021-04-13 at 00:56 -0600, Yu Zhao wrote:
> What's new in v2
> ================
> Special thanks to Jens Axboe for reporting a regression in buffered
> I/O and helping test the fix.
>
> This version includes the support of tiers, which represent levels of
> usage from file descriptors only. Pages accessed N times via file
> descriptors belong to tier order_base_2(N). Each generation contains
> at most MAX_NR_TIERS tiers, and they require additional MAX_NR_TIERS-2
> bits in page->flags. In contrast to moving across generations which
> requires the lru lock, moving across tiers only involves an atomic
> operation on page->flags and therefore has a negligible cost. A
> feedback loop modeled after the well-known PID controller monitors the
> refault rates across all tiers and decides when to activate pages from
> which tiers, on the reclaim path.
>
> This feedback model has a few advantages over the current feedforward
> model:
> 1) It has a negligible overhead in the buffered I/O access path
>    because activations are done in the reclaim path.
> 2) It takes mapped pages into account and avoids overprotecting pages
>    accessed multiple times via file descriptors.
> 3) More tiers offer better protection to pages accessed more than
>    twice when buffered-I/O-intensive workloads are under memory
>    pressure.
>
> The fio/io_uring benchmark shows 14% improvement in IOPS when randomly
> accessing Samsung PM981a in the buffered I/O mode.
>
> Highlights from the discussions on v1
> =====================================
> Thanks to Ying Huang and Dave Hansen for the comments and suggestions
> on page table scanning.
>
> A simple worst-case scenario test did not find page table scanning
> underperforms the rmap because of the following optimizations:
> 1) It will not scan page tables from processes that have been sleeping
>    since the last scan.
> 2) It will not scan PTE tables under non-leaf PMD entries that do not
>    have the accessed bit set, when
>    CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.
> 3) It will not zigzag between the PGD table and the same PMD or PTE
>    table spanning multiple VMAs. In other words, it finishes all the
>    VMAs with the range of the same PMD or PTE table before it returns
>    to the PGD table. This optimizes workloads that have large numbers
>    of tiny VMAs, especially when CONFIG_PGTABLE_LEVELS=5.
>
> TLDR
> ====
> The current page reclaim is too expensive in terms of CPU usage and
> often making poor choices about what to evict. We would like to offer
> an alternative framework that is performant, versatile and
> straightforward.
>
> Repo
> ====
> git fetch https://linux-mm.googlesource.com/page-reclaim refs/changes/73/1173/1
>
> Gerrit https://linux-mm-review.googlesource.com/c/page-reclaim/+/1173
>
> Background
> ==========
> DRAM is a major factor in total cost of ownership, and improving
> memory overcommit brings a high return on investment. Over the past
> decade of research and experimentation in memory overcommit, we
> observed a distinct trend across millions of servers and clients: the
> size of page cache has been decreasing because of the growing
> popularity of cloud storage. Nowadays anon pages account for more than
> 90% of our memory consumption and page cache contains mostly
> executable pages.
>
> Problems
> ========
> Notion of active/inactive
> -------------------------
> For servers equipped with hundreds of gigabytes of memory, the
> granularity of the active/inactive is too coarse to be useful for job
> scheduling. False active/inactive rates are relatively high, and thus
> the assumed savings may not materialize.
>
> For phones and laptops, executable pages are frequently evicted
> despite the fact that there are many less recently used anon pages.
> Major faults on executable pages cause "janks" (slow UI renderings)
> and negatively impact user experience.
>
> For lruvecs from different memcgs or nodes, comparisons are impossible
> due to the lack of a common frame of reference.
>
> Incremental scans via rmap
> --------------------------
> Each incremental scan picks up at where the last scan left off and
> stops after it has found a handful of unreferenced pages. For
> workloads using a large amount of anon memory, incremental scans lose
> the advantage under sustained memory pressure due to high ratios of
> the number of scanned pages to the number of reclaimed pages. In our
> case, the average ratio of pgscan to pgsteal is above 7.
>
> On top of that, the rmap has poor memory locality due to its complex
> data structures. The combined effects typically result in a high
> amount of CPU usage in the reclaim path. For example, with zram, a
> typical kswapd profile on v5.11 looks like:
>   31.03%  page_vma_mapped_walk
>   25.59%  lzo1x_1_do_compress
>    4.63%  do_raw_spin_lock
>    3.89%  vma_interval_tree_iter_next
>    3.33%  vma_interval_tree_subtree_search
>
> And with real swap, it looks like:
>   45.16%  page_vma_mapped_walk
>    7.61%  do_raw_spin_lock
>    5.69%  vma_interval_tree_iter_next
>    4.91%  vma_interval_tree_subtree_search
>    3.71%  page_referenced_one
>
> Solutions
> =========
> Notion of generation numbers
> ----------------------------
> The notion of generation numbers introduces a quantitative approach to
> memory overcommit. A larger number of pages can be spread out across
> a configurable number of generations, and each generation includes all
> pages that have been referenced since the last generation. This
> improved granularity yields relatively low false active/inactive
> rates.
>
> Given an lruvec, scans of anon and file types and selections between
> them are all based on direct comparisons of generation numbers, which
> are simple and yet effective. For different lruvecs, comparisons are
> still possible based on birth times of generations.
>
> Differential scans via page tables
> ----------------------------------
> Each differential scan discovers all pages that have been referenced
> since the last scan. Specifically, it walks the mm_struct list
> associated with an lruvec to scan page tables of processes that have
> been scheduled since the last scan. The cost of each differential scan
> is roughly proportional to the number of referenced pages it
> discovers. Unless address spaces are extremely sparse, page tables
> usually have better memory locality than the rmap. The end result is
> generally a significant reduction in CPU usage, for workloads using a
> large amount of anon memory.
>
> Our real-world benchmark that browses popular websites in multiple
> Chrome tabs demonstrates 51% less CPU usage from kswapd and 52% (full)
> less PSI on v5.11. With this patchset, kswapd profile looks like:
>   49.36%  lzo1x_1_do_compress
>    4.54%  page_vma_mapped_walk
>    4.45%  memset_erms
>    3.47%  walk_pte_range
>    2.88%  zram_bvec_rw
>
> In addition, direct reclaim latency is reduced by 22% at 99th
> percentile and the number of refaults is reduced by 7%. Both metrics
> are important to phones and laptops as they are correlated to user
> experience.
>
> Framework
> =========
> For each lruvec, evictable pages are divided into multiple
> generations. The youngest generation number is stored in
> lruvec->evictable.max_seq for both anon and file types as they are
> aged on an equal footing. The oldest generation numbers are stored in
> lruvec->evictable.min_seq[2] separately for anon and file types as
> clean file pages can be evicted regardless of may_swap or
> may_writepage. Generation numbers are truncated into
> order_base_2(MAX_NR_GENS+1) bits in order to fit into page->flags. The
> sliding window technique is used to prevent truncated generation
> numbers from overlapping. Each truncated generation number is an inde
> to lruvec->evictable.lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES].
> Evictable pages are added to the per-zone lists indexed by max_seq or
> min_seq[2] (modulo MAX_NR_GENS), depending on whether they are being
> faulted in.
>
> Each generation is then divided into multiple tiers. Tiers represent
> levels of usage from file descriptors only. Pages accessed N times via
> file descriptors belong to tier order_base_2(N). In contrast to moving
> across generations which requires the lru lock, moving across tiers
> only involves an atomic operation on page->flags and therefore has a
> lower cost. A feedback loop modeled after the well-known PID
> controller monitors the refault rates across all tiers and decides
> when to activate pages from which tiers on the reclaim path.
>
> The framework comprises two conceptually independent components: the
> aging and the eviction, which can be invoked separately from user
> space.
>
> Aging
> -----
> The aging produces young generations. Given an lruvec, the aging scans
> page tables for referenced pages of this lruvec. Upon finding one, the
> aging updates its generation number to max_seq. After each round of
> scan, the aging increments max_seq.
>
> The aging maintains either a system-wide mm_struct list or per-memcg
> mm_struct lists and tracks whether an mm_struct is being used or has
> been used since the last scan. Multiple threads can concurrently work
> on the same mm_struct list, and each of them will be given a different
> mm_struct belonging to a process that has been scheduled since the
> last scan.
>
> The aging is due when both of min_seq[2] reaches max_seq-1, assuming
> both anon and file types are reclaimable.
>
> Eviction
> --------
> The eviction consumes old generations. Given an lruvec, the eviction
> scans the pages on the per-zone lists indexed by either of min_seq[2].
> It first tries to select a type based on the values of min_seq[2].
> When anon and file types are both available from the same generation,
> it selects the one that has a lower refault rate.
>
> During a scan, the eviction sorts pages according to their generation
> numbers, if the aging has found them referenced. It also moves pages
> from the tiers that have higher refault rates than tier 0 to the next
> generation.
>
> When it finds all the per-zone lists of a selected type are empty, the
> eviction increments min_seq[2] indexed by this selected type.
>
> Use cases
> =========
> On Android, our most advanced simulation that generates memory
> pressure from realistic user behavior shows 18% fewer low-memory
> kills, which in turn reduces cold starts by 16%.
>
> On Borg, a similar approach enables us to identify jobs that
> underutilize their memory and downsize them considerably without
> compromising any of our service level indicators.
>
> On Chrome OS, our field telemetry reports 96% fewer low-memory tab
> discards and 59% fewer OOM kills from fully-utilized devices and no
> regressions in monitored user experience from underutilized devices.
>
> Working set estimation
> ----------------------
> User space can invoke the aging by writing "+ memcg_id node_id gen
> [swappiness]" to /sys/kernel/debug/lru_gen. This debugfs interface
> also provides the birth time and the size of each generation.
>
> Proactive reclaim
> -----------------
> User space can invoke the eviction by writing "- memcg_id node_id gen
> [swappiness] [nr_to_reclaim]" to /sys/kernel/debug/lru_gen. Multiple
> command lines are supported, so does concatenation with delimiters.
>
> Intensive buffered I/O
> ----------------------
> Tiers are specifically designed to improve the performance of
> intensive buffered I/O under memory pressure. The fio/io_uring
> benchmark shows 14% improvement in IOPS when randomly accessing
> Samsung PM981a in buffered I/O mode.
>
> For far memory tiering and NUMA-aware job scheduling, please refer to
> the reference section.
>
> FAQ
> ===
> Why not try to improve the existing code?
> -----------------------------------------
> We have tried but concluded the aforementioned problems are
> fundamental, and therefore changes made on top of them will not result
> in substantial gains.
>
> What particular workloads does it help?
> ---------------------------------------
> This framework is designed to improve the performance of the page
> reclaim under any types of workloads.
>
> How would it benefit the community?
> -----------------------------------
> Google is committed to promoting sustainable development of the
> community. We hope successful adoptions of this framework will
> steadily climb over time. To that end, we would be happy to learn your
> workloads and work with you case by case, and we will do our best to
> keep the repo fully maintained. For those whose workloads rely on the
> existing code, we will make sure you will not be affected in any way.
>
> References
> ==========
> 1. Long-term SLOs for reclaimed cloud computing resources
>    https://research.google/pubs/pub43017/
> 2. Profiling a warehouse-scale computer
>    https://research.google/pubs/pub44271/
> 3. Evaluation of NUMA-Aware Scheduling in Warehouse-Scale Clusters
>    https://research.google/pubs/pub48329/
> 4. Software-defined far memory in warehouse-scale computers
>    https://research.google/pubs/pub48551/
> 5. Borg: the Next Generation
>    https://research.google/pubs/pub49065/
>
> Yu Zhao (16):
>   include/linux/memcontrol.h: do not warn in page_memcg_rcu() if
>     !CONFIG_MEMCG
>   include/linux/nodemask.h: define next_memory_node() if !CONFIG_NUMA
>   include/linux/huge_mm.h: define is_huge_zero_pmd() if
>     !CONFIG_TRANSPARENT_HUGEPAGE
>   include/linux/cgroup.h: export cgroup_mutex
>   mm/swap.c: export activate_page()
>   mm, x86: support the access bit on non-leaf PMD entries
>   mm/vmscan.c: refactor shrink_node()
>   mm: multigenerational lru: groundwork
>   mm: multigenerational lru: activation
>   mm: multigenerational lru: mm_struct list
>   mm: multigenerational lru: aging
>   mm: multigenerational lru: eviction
>   mm: multigenerational lru: page reclaim
>   mm: multigenerational lru: user interface
>   mm: multigenerational lru: Kconfig
>   mm: multigenerational lru: documentation
>
>  Documentation/vm/index.rst        |    1 +
>  Documentation/vm/multigen_lru.rst |  192 +++
>  arch/Kconfig                      |    9 +
>  arch/x86/Kconfig                  |    1 +
>  arch/x86/include/asm/pgtable.h    |    2 +-
>  arch/x86/mm/pgtable.c             |    5 +-
>  fs/exec.c                         |    2 +
>  fs/fuse/dev.c                     |    3 +-
>  fs/proc/task_mmu.c                |    3 +-
>  include/linux/cgroup.h            |   15 +-
>  include/linux/huge_mm.h           |    5 +
>  include/linux/memcontrol.h        |    7 +-
>  include/linux/mm.h                |    2 +
>  include/linux/mm_inline.h         |  294 ++++
>  include/linux/mm_types.h          |  117 ++
>  include/linux/mmzone.h            |  118 +-
>  include/linux/nodemask.h          |    1 +
>  include/linux/page-flags-layout.h |   20 +-
>  include/linux/page-flags.h        |    4 +-
>  include/linux/pgtable.h           |    4 +-
>  include/linux/swap.h              |    5 +-
>  kernel/bounds.c                   |    6 +
>  kernel/events/uprobes.c           |    2 +-
>  kernel/exit.c                     |    1 +
>  kernel/fork.c                     |   10 +
>  kernel/kthread.c                  |    1 +
>  kernel/sched/core.c               |    2 +
>  mm/Kconfig                        |   55 +
>  mm/huge_memory.c                  |    5 +-
>  mm/khugepaged.c                   |    2 +-
>  mm/memcontrol.c                   |   28 +
>  mm/memory.c                       |   14 +-
>  mm/migrate.c                      |    2 +-
>  mm/mm_init.c                      |   16 +-
>  mm/mmzone.c                       |    2 +
>  mm/rmap.c                         |    6 +
>  mm/swap.c                         |   54 +-
>  mm/swapfile.c                     |    6 +-
>  mm/userfaultfd.c                  |    2 +-
>  mm/vmscan.c                       | 2580 ++++++++++++++++++++++++++++-
>  mm/workingset.c                   |  179 +-
>  41 files changed, 3603 insertions(+), 180 deletions(-)
>  create mode 100644 Documentation/vm/multigen_lru.rst
>