Re: [PATCH mm-unstable v4 3/7] mm/mglru: rework aging feedback

From: Kairui Song
Date: Wed Jan 15 2025 - 12:57:05 EST


On Mon, Jan 13, 2025 at 2:51 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote:
>
> On Wed, Jan 08, 2025 at 01:14:58AM +0800, Kairui Song wrote:
> > On Tue, Dec 31, 2024 at 12:36 PM Yu Zhao <yuzhao@xxxxxxxxxx> wrote:
> >
> > Hi Yu,
> >
> > >
> > > The aging feedback is based on both the number of generations and the
> > > distribution of folios in each generation. The number of generations
> > > is currently the distance between max_seq and anon min_seq. This is
> > > because anon min_seq is not allowed to move past file min_seq. The
> > > rationale for that is that file is always evictable whereas anon is
> > > not. However, for use cases where anon is a lot cheaper than file:
> > > 1. Anon in the second oldest generation can be a better choice than
> > > file in the oldest generation.
> > > 2. A large amount of file in the oldest generation can skew the
> > > distribution, making should_run_aging() return false negative.
> > >
> > > Allow anon and file min_seq to move independently, and use solely the
> > > number of generations as the feedback for aging. Specifically, when
> > > both anon and file are evictable, anon min_seq can now be greater than
> > > file min_seq, and therefore the number of generations becomes the
> > > distance between max_seq and min(min_seq[0],min_seq[1]). And
> > > should_run_aging() returns true if and only if the number of
> > > generations is less than MAX_NR_GENS.
> > >
> > > As the first step to the final optimization, this change by itself
> > > should not have userspace-visiable effects beyond performance. The
> > > next twos patch will take advantage of this change; the last patch in
> > > this series will better distribute folios across MAX_NR_GENS.
> > >
> > > Reported-by: David Stevens <stevensd@xxxxxxxxxxxx>
> > > Signed-off-by: Yu Zhao <yuzhao@xxxxxxxxxx>
> > > Tested-by: Kalesh Singh <kaleshsingh@xxxxxxxxxx>
> > > ---
> > > include/linux/mmzone.h | 17 ++--
> > > mm/vmscan.c | 200 ++++++++++++++++++-----------------------
> > > 2 files changed, 96 insertions(+), 121 deletions(-)
> > >
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index b36124145a16..8245ecb0400b 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -421,12 +421,11 @@ enum {
> > > /*
> > > * The youngest generation number is stored in max_seq for both anon and file
> > > * types as they are aged on an equal footing. The oldest generation numbers are
> > > - * stored in min_seq[] separately for anon and file types as clean file pages
> > > - * can be evicted regardless of swap constraints.
> > > - *
> > > - * Normally anon and file min_seq are in sync. But if swapping is constrained,
> > > - * e.g., out of swap space, file min_seq is allowed to advance and leave anon
> > > - * min_seq behind.
> > > + * stored in min_seq[] separately for anon and file types so that they can be
> > > + * incremented independently. Ideally min_seq[] are kept in sync when both anon
> > > + * and file types are evictable. However, to adapt to situations like extreme
> > > + * swappiness, they are allowed to be out of sync by at most
> > > + * MAX_NR_GENS-MIN_NR_GENS-1.
> > > *
> > > * The number of pages in each generation is eventually consistent and therefore
> > > * can be transiently negative when reset_batch_size() is pending.
> > > @@ -446,8 +445,8 @@ struct lru_gen_folio {
> > > unsigned long avg_refaulted[ANON_AND_FILE][MAX_NR_TIERS];
> > > /* the exponential moving average of evicted+protected */
> > > unsigned long avg_total[ANON_AND_FILE][MAX_NR_TIERS];
> > > - /* the first tier doesn't need protection, hence the minus one */
> > > - unsigned long protected[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS - 1];
> > > + /* can only be modified under the LRU lock */
> > > + unsigned long protected[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
> > > /* can be modified without holding the LRU lock */
> > > atomic_long_t evicted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
> > > atomic_long_t refaulted[NR_HIST_GENS][ANON_AND_FILE][MAX_NR_TIERS];
> > > @@ -498,7 +497,7 @@ struct lru_gen_mm_walk {
> > > int mm_stats[NR_MM_STATS];
> > > /* total batched items */
> > > int batched;
> > > - bool can_swap;
> > > + int swappiness;
> > > bool force_scan;
> > > };
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index f236db86de8a..f767e3d34e73 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2627,11 +2627,17 @@ static bool should_clear_pmd_young(void)
> > > READ_ONCE((lruvec)->lrugen.min_seq[LRU_GEN_FILE]), \
> > > }
> > >
> > > +#define evictable_min_seq(min_seq, swappiness) \
> > > + min((min_seq)[!(swappiness)], (min_seq)[(swappiness) != MAX_SWAPPINESS])
> > > +
> > > #define for_each_gen_type_zone(gen, type, zone) \
> > > for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++) \
> > > for ((type) = 0; (type) < ANON_AND_FILE; (type)++) \
> > > for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
> > >
> > > +#define for_each_evictable_type(type, swappiness) \
> > > + for ((type) = !(swappiness); (type) <= ((swappiness) != MAX_SWAPPINESS); (type)++)
> > > +
> > > #define get_memcg_gen(seq) ((seq) % MEMCG_NR_GENS)
> > > #define get_memcg_bin(bin) ((bin) % MEMCG_NR_BINS)
> > >
> > > @@ -2677,10 +2683,16 @@ static int get_nr_gens(struct lruvec *lruvec, int type)
> > >
> > > static bool __maybe_unused seq_is_valid(struct lruvec *lruvec)
> > > {
> > > - /* see the comment on lru_gen_folio */
> > > - return get_nr_gens(lruvec, LRU_GEN_FILE) >= MIN_NR_GENS &&
> > > - get_nr_gens(lruvec, LRU_GEN_FILE) <= get_nr_gens(lruvec, LRU_GEN_ANON) &&
> > > - get_nr_gens(lruvec, LRU_GEN_ANON) <= MAX_NR_GENS;
> > > + int type;
> > > +
> > > + for (type = 0; type < ANON_AND_FILE; type++) {
> > > + int n = get_nr_gens(lruvec, type);
> > > +
> > > + if (n < MIN_NR_GENS || n > MAX_NR_GENS)
> > > + return false;
> > > + }
> > > +
> > > + return true;
> > > }
> > >
> > > /******************************************************************************
> > > @@ -3087,9 +3099,8 @@ static void read_ctrl_pos(struct lruvec *lruvec, int type, int tier, int gain,
> > > pos->refaulted = lrugen->avg_refaulted[type][tier] +
> > > atomic_long_read(&lrugen->refaulted[hist][type][tier]);
> > > pos->total = lrugen->avg_total[type][tier] +
> > > + lrugen->protected[hist][type][tier] +
> > > atomic_long_read(&lrugen->evicted[hist][type][tier]);
> > > - if (tier)
> > > - pos->total += lrugen->protected[hist][type][tier - 1];
> > > pos->gain = gain;
> > > }
> > >
> > > @@ -3116,17 +3127,15 @@ static void reset_ctrl_pos(struct lruvec *lruvec, int type, bool carryover)
> > > WRITE_ONCE(lrugen->avg_refaulted[type][tier], sum / 2);
> > >
> > > sum = lrugen->avg_total[type][tier] +
> > > + lrugen->protected[hist][type][tier] +
> > > atomic_long_read(&lrugen->evicted[hist][type][tier]);
> > > - if (tier)
> > > - sum += lrugen->protected[hist][type][tier - 1];
> > > WRITE_ONCE(lrugen->avg_total[type][tier], sum / 2);
> > > }
> > >
> > > if (clear) {
> > > atomic_long_set(&lrugen->refaulted[hist][type][tier], 0);
> > > atomic_long_set(&lrugen->evicted[hist][type][tier], 0);
> > > - if (tier)
> > > - WRITE_ONCE(lrugen->protected[hist][type][tier - 1], 0);
> > > + WRITE_ONCE(lrugen->protected[hist][type][tier], 0);
> > > }
> > > }
> > > }
> > > @@ -3261,7 +3270,7 @@ static int should_skip_vma(unsigned long start, unsigned long end, struct mm_wal
> > > return true;
> > >
> > > if (vma_is_anonymous(vma))
> > > - return !walk->can_swap;
> > > + return !walk->swappiness;
> > >
> > > if (WARN_ON_ONCE(!vma->vm_file || !vma->vm_file->f_mapping))
> > > return true;
> > > @@ -3271,7 +3280,10 @@ static int should_skip_vma(unsigned long start, unsigned long end, struct mm_wal
> > > return true;
> > >
> > > if (shmem_mapping(mapping))
> > > - return !walk->can_swap;
> > > + return !walk->swappiness;
> > > +
> > > + if (walk->swappiness == MAX_SWAPPINESS)
> > > + return true;
> > >
> > > /* to exclude special mappings like dax, etc. */
> > > return !mapping->a_ops->read_folio;
> > > @@ -3359,7 +3371,7 @@ static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned
> > > }
> > >
> > > static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
> > > - struct pglist_data *pgdat, bool can_swap)
> > > + struct pglist_data *pgdat)
> > > {
> > > struct folio *folio;
> > >
> > > @@ -3370,10 +3382,6 @@ static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
> > > if (folio_memcg(folio) != memcg)
> > > return NULL;
> > >
> > > - /* file VMAs can contain anon pages from COW */
> > > - if (!folio_is_file_lru(folio) && !can_swap)
> > > - return NULL;
> > > -
> > > return folio;
> > > }
> > >
> > > @@ -3429,7 +3437,7 @@ static bool walk_pte_range(pmd_t *pmd, unsigned long start, unsigned long end,
> > > if (pfn == -1)
> > > continue;
> > >
> > > - folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
> > > + folio = get_pfn_folio(pfn, memcg, pgdat);
> > > if (!folio)
> > > continue;
> > >
> > > @@ -3514,7 +3522,7 @@ static void walk_pmd_range_locked(pud_t *pud, unsigned long addr, struct vm_area
> > > if (pfn == -1)
> > > goto next;
> > >
> > > - folio = get_pfn_folio(pfn, memcg, pgdat, walk->can_swap);
> > > + folio = get_pfn_folio(pfn, memcg, pgdat);
> > > if (!folio)
> > > goto next;
> > >
> > > @@ -3726,22 +3734,26 @@ static void clear_mm_walk(void)
> > > kfree(walk);
> > > }
> > >
> > > -static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
> > > +static bool inc_min_seq(struct lruvec *lruvec, int type, int swappiness)
> > > {
> > > int zone;
> > > int remaining = MAX_LRU_BATCH;
> > > struct lru_gen_folio *lrugen = &lruvec->lrugen;
> > > + int hist = lru_hist_from_seq(lrugen->min_seq[type]);
> > > int new_gen, old_gen = lru_gen_from_seq(lrugen->min_seq[type]);
> > >
> > > - if (type == LRU_GEN_ANON && !can_swap)
> > > + if (type ? swappiness == MAX_SWAPPINESS : !swappiness)
> > > goto done;
> > >
> > > - /* prevent cold/hot inversion if force_scan is true */
> > > + /* prevent cold/hot inversion if the type is evictable */
> > > for (zone = 0; zone < MAX_NR_ZONES; zone++) {
> > > struct list_head *head = &lrugen->folios[old_gen][type][zone];
> > >
> > > while (!list_empty(head)) {
> > > struct folio *folio = lru_to_folio(head);
> > > + int refs = folio_lru_refs(folio);
> > > + int tier = lru_tier_from_refs(refs);
> > > + int delta = folio_nr_pages(folio);
> > >
> > > VM_WARN_ON_ONCE_FOLIO(folio_test_unevictable(folio), folio);
> > > VM_WARN_ON_ONCE_FOLIO(folio_test_active(folio), folio);
> > > @@ -3751,6 +3763,9 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
> > > new_gen = folio_inc_gen(lruvec, folio, false);
> > > list_move_tail(&folio->lru, &lrugen->folios[new_gen][type][zone]);
> > >
> > > + WRITE_ONCE(lrugen->protected[hist][type][tier],
> > > + lrugen->protected[hist][type][tier] + delta);
> > > +
> > > if (!--remaining)
> > > return false;
> > > }
> > > @@ -3762,7 +3777,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap)
> > > return true;
> > > }
> > >
> > > -static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
> > > +static bool try_to_inc_min_seq(struct lruvec *lruvec, int swappiness)
> > > {
> > > int gen, type, zone;
> > > bool success = false;
> > > @@ -3772,7 +3787,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
> > > VM_WARN_ON_ONCE(!seq_is_valid(lruvec));
> > >
> > > /* find the oldest populated generation */
> > > - for (type = !can_swap; type < ANON_AND_FILE; type++) {
> > > + for_each_evictable_type(type, swappiness) {
> > > while (min_seq[type] + MIN_NR_GENS <= lrugen->max_seq) {
> > > gen = lru_gen_from_seq(min_seq[type]);
> > >
> > > @@ -3788,13 +3803,17 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
> > > }
> > >
> > > /* see the comment on lru_gen_folio */
> > > - if (can_swap) {
> > > - min_seq[LRU_GEN_ANON] = min(min_seq[LRU_GEN_ANON], min_seq[LRU_GEN_FILE]);
> > > - min_seq[LRU_GEN_FILE] = max(min_seq[LRU_GEN_ANON], lrugen->min_seq[LRU_GEN_FILE]);
> > > + if (swappiness && swappiness != MAX_SWAPPINESS) {
> > > + unsigned long seq = lrugen->max_seq - MIN_NR_GENS;
> > > +
> > > + if (min_seq[LRU_GEN_ANON] > seq && min_seq[LRU_GEN_FILE] < seq)
> > > + min_seq[LRU_GEN_ANON] = seq;
> > > + else if (min_seq[LRU_GEN_FILE] > seq && min_seq[LRU_GEN_ANON] < seq)
> > > + min_seq[LRU_GEN_FILE] = seq;
> > > }
> > >
> > > - for (type = !can_swap; type < ANON_AND_FILE; type++) {
> > > - if (min_seq[type] == lrugen->min_seq[type])
> > > + for_each_evictable_type(type, swappiness) {
> > > + if (min_seq[type] <= lrugen->min_seq[type])
> > > continue;
> > >
> > > reset_ctrl_pos(lruvec, type, true);
> > > @@ -3805,8 +3824,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap)
> > > return success;
> > > }
> > >
> > > -static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq,
> > > - bool can_swap, bool force_scan)
> > > +static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq, int swappiness)
> > > {
> > > bool success;
> > > int prev, next;
> > > @@ -3824,13 +3842,11 @@ static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq,
> > > if (!success)
> > > goto unlock;
> > >
> > > - for (type = ANON_AND_FILE - 1; type >= 0; type--) {
> > > + for (type = 0; type < ANON_AND_FILE; type++) {
> > > if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
> > > continue;
> > >
> > > - VM_WARN_ON_ONCE(!force_scan && (type == LRU_GEN_FILE || can_swap));
> > > -
> > > - if (inc_min_seq(lruvec, type, can_swap))
> > > + if (inc_min_seq(lruvec, type, swappiness))
> > > continue;
> > >
> > > spin_unlock_irq(&lruvec->lru_lock);
> > > @@ -3874,7 +3890,7 @@ static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq,
> > > }
> > >
> > > static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq,
> > > - bool can_swap, bool force_scan)
> > > + int swappiness, bool force_scan)
> > > {
> > > bool success;
> > > struct lru_gen_mm_walk *walk;
> > > @@ -3885,7 +3901,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq,
> > > VM_WARN_ON_ONCE(seq > READ_ONCE(lrugen->max_seq));
> > >
> > > if (!mm_state)
> > > - return inc_max_seq(lruvec, seq, can_swap, force_scan);
> > > + return inc_max_seq(lruvec, seq, swappiness);
> > >
> > > /* see the comment in iterate_mm_list() */
> > > if (seq <= READ_ONCE(mm_state->seq))
> > > @@ -3910,7 +3926,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq,
> > >
> > > walk->lruvec = lruvec;
> > > walk->seq = seq;
> > > - walk->can_swap = can_swap;
> > > + walk->swappiness = swappiness;
> > > walk->force_scan = force_scan;
> > >
> > > do {
> > > @@ -3920,7 +3936,7 @@ static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long seq,
> > > } while (mm);
> > > done:
> > > if (success) {
> > > - success = inc_max_seq(lruvec, seq, can_swap, force_scan);
> > > + success = inc_max_seq(lruvec, seq, swappiness);
> > > WARN_ON_ONCE(!success);
> > > }
> > >
> > > @@ -3961,13 +3977,13 @@ static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc)
> > > {
> > > int gen, type, zone;
> > > unsigned long total = 0;
> > > - bool can_swap = get_swappiness(lruvec, sc);
> > > + int swappiness = get_swappiness(lruvec, sc);
> > > struct lru_gen_folio *lrugen = &lruvec->lrugen;
> > > struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > > DEFINE_MAX_SEQ(lruvec);
> > > DEFINE_MIN_SEQ(lruvec);
> > >
> > > - for (type = !can_swap; type < ANON_AND_FILE; type++) {
> > > + for_each_evictable_type(type, swappiness) {
> > > unsigned long seq;
> > >
> > > for (seq = min_seq[type]; seq <= max_seq; seq++) {
> > > @@ -3987,6 +4003,7 @@ static bool lruvec_is_reclaimable(struct lruvec *lruvec, struct scan_control *sc
> > > {
> > > int gen;
> > > unsigned long birth;
> > > + int swappiness = get_swappiness(lruvec, sc);
> > > struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > > DEFINE_MIN_SEQ(lruvec);
> > >
> > > @@ -3996,8 +4013,7 @@ static bool lruvec_is_reclaimable(struct lruvec *lruvec, struct scan_control *sc
> > > if (!lruvec_is_sizable(lruvec, sc))
> > > return false;
> > >
> > > - /* see the comment on lru_gen_folio */
> > > - gen = lru_gen_from_seq(min_seq[LRU_GEN_FILE]);
> > > + gen = lru_gen_from_seq(evictable_min_seq(min_seq, swappiness));
> > > birth = READ_ONCE(lruvec->lrugen.timestamps[gen]);
> > >
> > > return time_is_before_jiffies(birth + min_ttl);
> > > @@ -4064,7 +4080,6 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
> > > unsigned long addr = pvmw->address;
> > > struct vm_area_struct *vma = pvmw->vma;
> > > struct folio *folio = pfn_folio(pvmw->pfn);
> > > - bool can_swap = !folio_is_file_lru(folio);
> > > struct mem_cgroup *memcg = folio_memcg(folio);
> > > struct pglist_data *pgdat = folio_pgdat(folio);
> > > struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
> > > @@ -4117,7 +4132,7 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
> > > if (pfn == -1)
> > > continue;
> > >
> > > - folio = get_pfn_folio(pfn, memcg, pgdat, can_swap);
> > > + folio = get_pfn_folio(pfn, memcg, pgdat);
> > > if (!folio)
> > > continue;
> > >
> > > @@ -4333,8 +4348,8 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> > > gen = folio_inc_gen(lruvec, folio, false);
> > > list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
> > >
> > > - WRITE_ONCE(lrugen->protected[hist][type][tier - 1],
> > > - lrugen->protected[hist][type][tier - 1] + delta);
> > > + WRITE_ONCE(lrugen->protected[hist][type][tier],
> > > + lrugen->protected[hist][type][tier] + delta);
> > > return true;
> > > }
> > >
> > > @@ -4533,7 +4548,6 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
> > > {
> > > int i;
> > > int type;
> > > - int scanned;
> > > int tier = -1;
> > > DEFINE_MIN_SEQ(lruvec);
> > >
> > > @@ -4558,21 +4572,23 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
> > > else
> > > type = get_type_to_scan(lruvec, swappiness, &tier);
> > >
> > > - for (i = !swappiness; i < ANON_AND_FILE; i++) {
> > > + for_each_evictable_type(i, swappiness) {
> >
> > Thanks for working on solving the reported issues, but one concern
> > about this for_each_evictable_type macro and its usage here.
> >
> > It basically forbids eviction of file pages with "swappiness == 200"
> > even for global pressure, this is a quite a change.
> >
> > For both active / inactive or MGLRU, max swappiness used to make
> > kernel try reclaim anon as much as possible, but still fall back to
> > file eviction. Forbidding file eviction may cause unsolvable OOM,
> > unlike anon pages, killing process won't necessarily release file
> > pages, so the system could hung easily.
> >
> > For existing systems with swappiness == 200 which were running fine
> > before, may also hit OOM very quickly.
>
> Do you know anyone actually uses 200? I only use 200 for testing but I
> can use 201 instead, since the debugfs interface isn't limited to [0,
> 200].
>

Thanks for the update patch.

Yes, I've seen some users using 200, especially with ZRAM/ZSWAP, so
the kernel prefers to keep page cache in memory when under pressure.
We also use 200 for some workloads.

We have an internal patch similar to your update, which allows using
201 for proactive reclaim, so proactive reclaim is able to only
compress pages in-RAM, to avoid increased IO due to page cache miss,
and it worked very well.