Re: [PATCH mm-unstable v2 6/6] mm/mglru: rework workingset protection

From: Yu Zhao
Date: Fri Dec 06 2024 - 23:44:29 EST


On Thu, Dec 05, 2024 at 05:31:26PM -0700, Yu Zhao wrote:
> With the aging feedback no longer considering the distribution of
> folios in each generation, rework workingset protection to better
> distribute folios across MAX_NR_GENS. This is achieved by reusing
> PG_workingset and PG_referenced/LRU_REFS_FLAGS in a slightly different
> way.
>
> For folios accessed multiple times through file descriptors, make
> lru_gen_inc_refs() set additional bits of LRU_REFS_WIDTH in
> folio->flags after PG_referenced, then PG_workingset after
> LRU_REFS_WIDTH. After all its bits are set, i.e.,
> LRU_REFS_FLAGS|BIT(PG_workingset), a folio is lazily promoted into the
> second oldest generation in the eviction path. And when
> folio_inc_gen() does that, it clears LRU_REFS_FLAGS so that
> lru_gen_inc_refs() can start over. For this case, LRU_REFS_MASK is
> only valid when PG_referenced is set.
>
> For folios accessed multiple times through page tables,
> folio_update_gen() from a page table walk or lru_gen_set_refs() from a
> rmap walk sets PG_referenced after the accessed bit is cleared for the
> first time. Thereafter, those two paths set PG_workingset and promote
> folios to the youngest generation. Like folio_inc_gen(), when
> folio_update_gen() does that, it also clears PG_referenced. For this
> case, LRU_REFS_MASK is not used.
>
> For both of the cases, after PG_workingset is set on a folio, it
> remains until this folio is either reclaimed, or "deactivated" by
> lru_gen_clear_refs(). It can be set again if lru_gen_test_recent()
> returns true upon a refault.
>
> When adding folios to the LRU lists, lru_gen_distance() distributes
> them as follows:
> +---------------------------------+---------------------------------+
> | Accessed thru page tables | Accessed thru file descriptors |
> +---------------------------------+---------------------------------+
> | PG_active (set while isolated) | |
> +----------------+----------------+----------------+----------------+
> | PG_workingset | PG_referenced | PG_workingset | LRU_REFS_FLAGS |
> +---------------------------------+---------------------------------+
> |<--------- MIN_NR_GENS --------->| |
> |<-------------------------- MAX_NR_GENS -------------------------->|
>
> After this patch, some typical client and server workloads showed
> improvements under heavy memory pressure. For example, Python TPC-C,
> which was used to benchmark a different approach [1] to better detect
> refault distances, showed a significant decrease in total refaults:
> Before After Change
> Time (seconds) 10801 10801 0%
> Executed (transactions) 41472 43663 +5%
> workingset_nodes 109070 120244 +10%
> workingset_refault_anon 5019627 7281831 +45%
> workingset_refault_file 1294678786 554855564 -57%
> workingset_refault_total 1299698413 562137395 -57%
>
> [1] https://lore.kernel.org/20230920190244.16839-1-ryncsn@xxxxxxxxx/
>
> Reported-by: Kairui Song <kasong@xxxxxxxxxxx>
> Closes: https://lore.kernel.org/CAOUHufahuWcKf5f1Sg3emnqX+cODuR=2TQo7T4Gr-QYLujn4RA@xxxxxxxxxxxxxx/
> Signed-off-by: Yu Zhao <yuzhao@xxxxxxxxxx>
> Tested-by: Kalesh Singh <kaleshsingh@xxxxxxxxxx>
> ---
> include/linux/mm_inline.h | 94 +++++++++++++------------
> include/linux/mmzone.h | 82 +++++++++++++---------
> mm/swap.c | 23 +++---
> mm/vmscan.c | 142 +++++++++++++++++++++++---------------
> mm/workingset.c | 29 ++++----
> 5 files changed, 209 insertions(+), 161 deletions(-)

Some outlier results from LULESH (Livermore Unstructured Lagrangian
Explicit Shock Hydrodynamics) [1] caught my eye. The following fix
made the benchmark a lot happier (128GB DRAM + Optane swap):
Before After Change
Average (z/s) 6894 7574 +10%
Deviation (10 samples) 12.96% 1.76% -86%

[1] https://asc.llnl.gov/codes/proxy-apps/lulesh

Andrew, can you please fold it in? Thanks!

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 90bbc2b3be8b..5e03a61c894f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -916,8 +916,7 @@ static enum folio_references folio_check_references(struct folio *folio,
if (!referenced_ptes)
return FOLIOREF_RECLAIM;

- lru_gen_set_refs(folio);
- return FOLIOREF_ACTIVATE;
+ return lru_gen_set_refs(folio) ? FOLIOREF_ACTIVATE : FOLIOREF_KEEP;
}

referenced_folio = folio_test_clear_referenced(folio);
@@ -4173,11 +4172,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
old_gen = folio_update_gen(folio, new_gen);
if (old_gen >= 0 && old_gen != new_gen)
update_batch_size(walk, folio, old_gen, new_gen);
-
- continue;
- }
-
- if (lru_gen_set_refs(folio)) {
+ } else if (lru_gen_set_refs(folio)) {
old_gen = folio_lru_gen(folio);
if (old_gen >= 0 && old_gen != new_gen)
folio_activate(folio);