Re: [PATCH mm-unstable v2 6/6] mm/mglru: rework workingset protection

From: Yu Zhao
Date: Sat Dec 07 2024 - 14:10:28 EST


On Fri, Dec 6, 2024 at 9:44 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote:
>
> On Thu, Dec 05, 2024 at 05:31:26PM -0700, Yu Zhao wrote:
> > With the aging feedback no longer considering the distribution of
> > folios in each generation, rework workingset protection to better
> > distribute folios across MAX_NR_GENS. This is achieved by reusing
> > PG_workingset and PG_referenced/LRU_REFS_FLAGS in a slightly different
> > way.
> >
> > For folios accessed multiple times through file descriptors, make
> > lru_gen_inc_refs() set additional bits of LRU_REFS_WIDTH in
> > folio->flags after PG_referenced, then PG_workingset after
> > LRU_REFS_WIDTH. After all its bits are set, i.e.,
> > LRU_REFS_FLAGS|BIT(PG_workingset), a folio is lazily promoted into the
> > second oldest generation in the eviction path. And when
> > folio_inc_gen() does that, it clears LRU_REFS_FLAGS so that
> > lru_gen_inc_refs() can start over. For this case, LRU_REFS_MASK is
> > only valid when PG_referenced is set.
> >
> > For folios accessed multiple times through page tables,
> > folio_update_gen() from a page table walk or lru_gen_set_refs() from a
> > rmap walk sets PG_referenced after the accessed bit is cleared for the
> > first time. Thereafter, those two paths set PG_workingset and promote
> > folios to the youngest generation. Like folio_inc_gen(), when
> > folio_update_gen() does that, it also clears PG_referenced. For this
> > case, LRU_REFS_MASK is not used.
> >
> > For both of the cases, after PG_workingset is set on a folio, it
> > remains until this folio is either reclaimed, or "deactivated" by
> > lru_gen_clear_refs(). It can be set again if lru_gen_test_recent()
> > returns true upon a refault.
> >
> > When adding folios to the LRU lists, lru_gen_distance() distributes
> > them as follows:
> > +---------------------------------+---------------------------------+
> > | Accessed thru page tables | Accessed thru file descriptors |
> > +---------------------------------+---------------------------------+
> > | PG_active (set while isolated) | |
> > +----------------+----------------+----------------+----------------+
> > | PG_workingset | PG_referenced | PG_workingset | LRU_REFS_FLAGS |
> > +---------------------------------+---------------------------------+
> > |<--------- MIN_NR_GENS --------->| |
> > |<-------------------------- MAX_NR_GENS -------------------------->|
> >
> > After this patch, some typical client and server workloads showed
> > improvements under heavy memory pressure. For example, Python TPC-C,
> > which was used to benchmark a different approach [1] to better detect
> > refault distances, showed a significant decrease in total refaults:
> > Before After Change
> > Time (seconds) 10801 10801 0%
> > Executed (transactions) 41472 43663 +5%
> > workingset_nodes 109070 120244 +10%
> > workingset_refault_anon 5019627 7281831 +45%
> > workingset_refault_file 1294678786 554855564 -57%
> > workingset_refault_total 1299698413 562137395 -57%
> >
> > [1] https://lore.kernel.org/20230920190244.16839-1-ryncsn@xxxxxxxxx/
> >
> > Reported-by: Kairui Song <kasong@xxxxxxxxxxx>
> > Closes: https://lore.kernel.org/CAOUHufahuWcKf5f1Sg3emnqX+cODuR=2TQo7T4Gr-QYLujn4RA@xxxxxxxxxxxxxx/
> > Signed-off-by: Yu Zhao <yuzhao@xxxxxxxxxx>
> > Tested-by: Kalesh Singh <kaleshsingh@xxxxxxxxxx>
> > ---
> > include/linux/mm_inline.h | 94 +++++++++++++------------
> > include/linux/mmzone.h | 82 +++++++++++++---------
> > mm/swap.c | 23 +++---
> > mm/vmscan.c | 142 +++++++++++++++++++++++---------------
> > mm/workingset.c | 29 ++++----
> > 5 files changed, 209 insertions(+), 161 deletions(-)
>
> Some outlier results from LULESH (Livermore Unstructured Lagrangian
> Explicit Shock Hydrodynamics) [1] caught my eye. The following fix
> made the benchmark a lot happier (128GB DRAM + Optane swap):
> Before After Change
> Average (z/s) 6894 7574 +10%
> Deviation (10 samples) 12.96% 1.76% -86%
>
> [1] https://asc.llnl.gov/codes/proxy-apps/lulesh
>
> Andrew, can you please fold it in? Thanks!

Never mind. syzbot found another warning. So let me fix that and post v3.

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 90bbc2b3be8b..5e03a61c894f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -916,8 +916,7 @@ static enum folio_references folio_check_references(struct folio *folio,
> if (!referenced_ptes)
> return FOLIOREF_RECLAIM;
>
> - lru_gen_set_refs(folio);
> - return FOLIOREF_ACTIVATE;
> + return lru_gen_set_refs(folio) ? FOLIOREF_ACTIVATE : FOLIOREF_KEEP;
> }
>
> referenced_folio = folio_test_clear_referenced(folio);
> @@ -4173,11 +4172,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
> old_gen = folio_update_gen(folio, new_gen);
> if (old_gen >= 0 && old_gen != new_gen)
> update_batch_size(walk, folio, old_gen, new_gen);
> -
> - continue;
> - }
> -
> - if (lru_gen_set_refs(folio)) {
> + } else if (lru_gen_set_refs(folio)) {
> old_gen = folio_lru_gen(folio);
> if (old_gen >= 0 && old_gen != new_gen)
> folio_activate(folio);