Re: [PATCH RFC 08/15] mm, swap: store and check memcg info in the swap table

From: Johannes Weiner

Date: Tue Feb 24 2026 - 10:58:57 EST

On Tue, Feb 24, 2026 at 04:34:00PM +0800, Kairui Song wrote:
> On Tue, Feb 24, 2026 at 12:46 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> >
> > On Fri, Feb 20, 2026 at 07:42:09AM +0800, Kairui Song via B4 Relay wrote:
> > > From: Kairui Song <kasong@xxxxxxxxxxx>
> > >
> > > To prepare for merging the swap_cgroup_ctrl into the swap table, store
> > > the memcg info in the swap table on swapout.
> > >
> > > This is done by using the existing shadow format.
> > >
> > > Note this also changes the refault counting at the nearest online memcg
> > > level:
> > >
> > > Unlike file folios, anon folios are mostly exclusive to one mem cgroup,
> > > and each cgroup is likely to have different characteristics.
> >
> > This is not correct.
> >
> > As much as I like the idea of storing the swap_cgroup association
> > inside the shadow entry, the refault evaluation needs to happen at the
> > level that drove eviction.
> >
> > Consider a workload that is split into cgroups purely for accounting,
> > not for setting different limits:
> >
> > workload (limit domain)
> > `- component A
> > `- component B
> >
> > This means the two components must compete freely, and it must behave
> > as if there is only one LRU. When pages get reclaimed in a round-robin
> > fashion, both A and B get aged at the same pace. Likewise, when pages
> > in A refault, they must challenge the *combined* workingset of both A
> > and B, not just the local pages.
> >
> > Otherwise, you risk retaining stale workingset in one subgroup while
> > the other one is thrashing. This breaks userspace expectations.
> >
>
> Hi Johannes, thanks for pointing this out.
>
> I'm just not sure how much of a real problem this is. The refault
> challenge change was made in commit b910718a948a which was before anon
> shadow was introduced. And shadows could get reclaimed, especially
> when under pressure (and we could be doing that again by reclaiming
> full_clusters with swap tables). And MGLRU simply ignores the
> target_memcg here yet it performs surprisingly well with multiple
> memcg setups. And I did find a comment in workingset.c saying the
> kernel used to activate all pages, which is also fine. And that commit
> also mentioned the active list shrinking, but anon active list gets
> shrinked just fine without refault feedback in shrink_lruvec under
> can_age_anon_pages.

*if inactive anon is empty, as part of the second
chance logic

Please try to understand *why* this code is the way it is before
throwing it all out. It was driven by real production problems. The
fact that some workloads don't care is not prove that many don't hurt
if you break this.

Anon refault detection was added for that reason: Once you have swap,
you facilitate anon workingsets that exceed memory capacity. At that
point, cache replacement strategies apply. Scan resistance matters.

With fast modern compression and flash swap, the anon set alone can be
larger than memory capacity. Everything that
6a3ed2123a78de22a9e2b2855068a8d89f8e14f4 says about file cache starts
applying to anonymous pages: you don't want to throw out the hot anon
workingset just because somebody is doing a one-off burst scan through
a larger set of cold, swapped out pages.

Like I said in the LSFMM thread, there is no difference between anon
and file. There didn't use to be historically. The LRU lists were
split mechanically because noswap systems became common (lots of RAM +
rotational drives = sad swap) and there was no point in scanning/aging
anonymous memory if there is no swap space.

But no reasonable argument has been put forth why anon should be aged
completely differently than file when you DO have swap.

There is more explanation of Why for the cgroup behavior in the cover
letter portion of 53138cea7f398d2cdd0fa22adeec7e16093e1ebd.