Re: [PATCH v2] mm/mglru: use folio_mark_accessed to replace folio_set_active

From: Barry Song

Date: Mon May 25 2026 - 18:13:58 EST


On Tue, May 26, 2026 at 1:58 AM Kairui Song <ryncsn@xxxxxxxxx> wrote:
>
> On Mon, May 25, 2026 at 8:32 PM Barry Song (Xiaomi) <baohua@xxxxxxxxxx> wrote:
> >
> > MGLRU gives high priority to folios mapped in page tables. As a result,
> > folio_set_active() is invoked for all folios read during page faults. In
> > practice, however, readahead can bring in many folios that are never
> > accessed via page tables.
> >
> > A previous attempt by Lei Liu proposed introducing a separate LRU for
> > readahead[1] to make readahead pages easier to reclaim, but that approach
> > is likely over-engineered.
> >
> > Before commit 4d5d14a01e2c ("mm/mglru: rework workingset protection"),
> > folios with PG_active were always placed in the youngest generation,
> > leading to over-protection and increased refaults. After that commit,
> > PG_active folios are placed in the second youngest generation, which is
> > still too optimistic given the presence of readahead. In contrast, the
> > classic active/inactive scheme is more conservative.
> >
> > This patch switches to using folio_mark_accessed(). If
> > folio_check_references() later detects referenced PTEs, the folio
> > will be promoted based on the reference flag set by
> > folio_mark_accessed(). We should also adjust
> > WORKINGSET_ACTIVATE and lru_gen_folio_seq(); for example, we should
> > not unconditionally protect anon folios accordingly.
> >
> > The following uses a simple model to demonstrate why the current code is
> > not ideal. It runs fio-3.42 in a memcg, reading a file in a strided
> > pattern—4KB every 64KB—to simulate prefaulted pages that may not be
> > accessed.
>
> Thanks so much for the update! This looks much better than V1 RFC.
>
> > Without the patch, we observed 12883855 file refaults and a very low
> > bandwidth of 58.5 MiB/s, because prefaulted but unused pages occupy hot
> > positions, continuously pushing out the real working set and causing
> > incorrect reclaim. With the patch, we observed 0 refaults and bandwidth
> > increased to 5078 MiB/s.
> >
> > Running the same test on x86:
> > w/o patch:
> > File Refault Delta is 3240029
> > READ: bw=13.2MiB/s io=1183MiB
> >
> > w/ patch:
> > File Refault Delta is 0
> > READ: bw=7708MiB/s io=676GiB
>
> Really nice result here.
>
> > diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
> > index a171070e15f0..c637e679a450 100644
> > --- a/include/linux/mm_inline.h
> > +++ b/include/linux/mm_inline.h
> > @@ -242,12 +242,11 @@ static inline unsigned long lru_gen_folio_seq(const struct lruvec *lruvec,
> > gen = MIN_NR_GENS - folio_test_workingset(folio);
> > else if (reclaiming)
> > gen = MAX_NR_GENS;
> > - else if ((!folio_is_file_lru(folio) && !folio_test_swapcache(folio)) ||
> > - (folio_test_reclaim(folio) &&
> > - (folio_test_dirty(folio) || folio_test_writeback(folio))))
> > + else if (folio_test_reclaim(folio) &&
> > + (folio_test_dirty(folio) || folio_test_writeback(folio)))
>
> This change is more than just "use folio_mark_accessed to replace
> folio_set_active", you are changing the pattern for all anon folios
> here? But I think we are mostly working with file folios here
> especially in the first test you posted?
>
> I do like your new if condition though, but maybe it's not the right
> time, or better add some words about it, how anon are handled now.

Right. For anon folios, the generation is currently hardcoded to
MIN_NR_GENS, so skipping folio_set_active will not help anon pages.
It is hardcoded regardless of the current anon folio state.
This change actually aligns anon behavior with file folios. Otherwise,
skipping folio_set_active() would have no effect on anon folios.

Maybe I can change the subject to " mm/mglru: avoid over-promoting
PF folios" and add some words in v3.

>
> > gen = MIN_NR_GENS;
> > else
> > - gen = MAX_NR_GENS - folio_test_workingset(folio);
> > + gen = MAX_NR_GENS - folio_test_workingset(folio) || folio_test_referenced(folio);
>
> Nice, this aligns exactly with what I want here :), see lru_gen_folio_seq:
> https://lore.kernel.org/linux-mm/20260502-mglru-fg-v1-6-913619b014d9@xxxxxxxxxxx/
>
> But I think you got the precedence wrong? Did you mean:
> (MAX_NR_GENS - folio_test_workingset(folio)) || folio_test_referenced(folio)
>
> Or?
> MAX_NR_GENS - (folio_test_workingset(folio) || folio_test_referenced(folio))

I mean this one. Sorry for the confusion. will fix it in v3.

>
> >
> > return max(READ_ONCE(lrugen->max_seq) - gen + 1, READ_ONCE(lrugen->min_seq[type]));
> > }
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 5cc44f0de987..f320f93d60df 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -511,8 +511,12 @@ void folio_add_lru(struct folio *folio)
> >
> > /* see the comment in lru_gen_folio_seq() */
> > if (lru_gen_enabled() && !folio_test_unevictable(folio) &&
> > - lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
> > - folio_set_active(folio);
> > + lru_gen_in_fault() && !(current->flags & PF_MEMALLOC)) {
> > + if (folio_test_workingset(folio))
> > + folio_set_active(folio);
>
> What if we also move this set_active to mm/workingset.c? I think that
> the only place where a folio with PG_workingset can get here?

I agree that folio_set_active() is duplicated between
mm/workingset.c and here. Actually, I should have removed
folio_set_active(folio) from mm/workingset.c to centralize the
active/referenced state handling in folio_add_lru(). But somehow I
forgot to do that before sending the patch.

Maybe we should just centralize it for now; that seems better. We
can reconsider it later together with your larger follow-up work.

>
> > + else if (!folio_test_referenced(folio))
> > + folio_mark_accessed(folio);
> > + }
>
> Same here? Also this folio_mark_accessed only means
> folio_set_referenced here, which matches the folio_test_referenced
> check you just added in lru_gen_folio_seq right?

Right. With folio_test_referenced(), we place folios into the second
oldest generation. Otherwise, putting them into the oldest generation
seems too aggressive for scanning. But putting them into the active
generation also seems too aggressive for promotion.

>
> >
> > folio_batch_add_and_move(folio, lru_add);
> > }
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index e452cb043d46..48355f10542b 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -848,7 +848,8 @@ static bool lru_gen_set_refs(struct folio *folio)
> > return false;
> > }
> >
> > - set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_workingset));
> > + if (folio_test_active(folio))
> > + set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_workingset));
>
> This looks a bit strange, PG_active is always unset for on LRU folios
> for MGLRU right?
>
> There is even a comment for that:
> "PG_active is always cleared while a folio is on one of lrugen->folios[]"

There still seem to be some cases where this flag remains set after
I added the printk.

@@ -848,8 +848,10 @@ static bool lru_gen_set_refs(struct folio *folio)
return false;
}

- if (folio_test_active(folio))
+ if (folio_test_active(folio)) {
+ pr_err("%s folio:%lx\n", __func__, folio);
set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS,
BIT(PG_workingset));
+ }
return true;
}

[ 78.928932] vmscan: lru_gen_set_refs folio:fffff645c58d99c0
[ 81.198056] vmscan: lru_gen_set_refs folio:fffff645c791a200
[ 94.071659] vmscan: lru_gen_set_refs folio:fffff645c773c8c0
[ 94.116162] vmscan: lru_gen_set_refs folio:fffff645c4f02300
[ 110.215317] vmscan: lru_gen_set_refs folio:fffff645c7c8ea40
[ 118.679921] vmscan: lru_gen_set_refs folio:fffff645c5727840
[ 135.709167] vmscan: lru_gen_set_refs folio:fffff645c51d2ac0
[ 139.118206] vmscan: lru_gen_set_refs folio:fffff645c7749fc0

But you are generally right. Those active flags are probably set after folios
are added to the LRU.

>
> And lru_gen_set_refs is mostly for on-LRU folios or isolated ones,
> either would have this flag. So PG_workingset seems broken by this
> point, rmap walk no longer get PG_workingset set, which removes the
> "promote on 2nd access" mechanism entirely?

Yes. I was trying to preserve the “promote on second access”
behavior, so I used the “active” flag. But it is somewhat
misused.

Before this patch:
1st scan sets the referenced flag;
2nd scan promotes based on seeing the referenced flag set by the first access.

Now the flow is:
1st scan already observes the referenced flag;
2nd scan tries to rely on the active flag. However, as you pointed
out, the active flag
is cleared after being set, so it can’t reliably serve this purpose.

Maybe I can add a ref and only set workingset when both PG_referenced and
that ref are present.

@@ -848,8 +848,11 @@ static bool lru_gen_set_refs(struct folio *folio)
return false;
}

- if (folio_test_active(folio))
+ /* Promote on second access */
+ if (folio_lru_refs(folio) > 1)
set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS,
BIT(PG_workingset));
+ else
+ folio_mark_accessed(folio);
return true;
}

It’s a bit ugly, but I can’t find another flag for it. Do you have a
better idea?

>
> > return true;
> > }
> > #else
> > diff --git a/mm/workingset.c b/mm/workingset.c
> > index 07e6836d0502..2f0c08aa8623 100644
> > --- a/mm/workingset.c
> > +++ b/mm/workingset.c
> > @@ -319,11 +319,11 @@ static void lru_gen_refault(struct folio *folio, void *shadow)
> >
> > atomic_long_add(delta, &lrugen->refaulted[hist][type][tier]);
> >
> > - /* see folio_add_lru() where folio_set_active() will be called */
> > - if (lru_gen_in_fault())
> > - mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
> > -
> > if (workingset) {
> > + if (lru_gen_in_fault()) {
> > + folio_set_active(folio);
> > + mod_lruvec_state(lruvec, WORKINGSET_ACTIVATE_BASE + type, delta);
> > + }
>
> Overall I agree with this directly that file folios shouldn't get over
> protected. It will be even better if we actually activate the folio
> using things like refault distance... Which can be done later.
>
> BTW, this would break reactivation if we age periodically. That is
> less of a problem compared to overprotection, I think.
>
> And do you think we can keep a folio_set_active if folio_test_anon ==
> true in swap.c? Proactive activation relies on the assumption that
> applications never expect an jitter for ordinary memory access, which
> holds true for anon folios, but file folios don't follow that idea.

I’m not entirely sure, but I see that removing folio_set_active()
could also help reduce anon refaults. Also, having two separate
branches for file and anon seems to make the code a bit ugly.

>
> And I'm not trying to seize the topic but my previous experiment using
> refault distance-based reactivation resulted in a 500% performance
> gain for some databases with no other regressions observed:
> https://lwn.net/Articles/945266/

Yep, that is really great!

I'd be happy to review and work on your refault
distance-based reactivation when I have some time.

>
> So for this patch I think it's on the right track, we can definitely
> merge some follow up of this patch first, and then refine the pattern
> step by step toward refault distance, MGLRU-FG or both plus others.
> That will take a long time, so let's fix the low-hanging fruit first.
>
> Really looking forward to a few more updates from this patch :) !

Thanks very much for your review.

Best Regards
Barry