Re: [PATCH RFC 08/15] mm, swap: store and check memcg info in the swap table

From: Kairui Song

Date: Tue Feb 24 2026 - 03:35:16 EST


On Tue, Feb 24, 2026 at 12:46 AM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>
> On Fri, Feb 20, 2026 at 07:42:09AM +0800, Kairui Song via B4 Relay wrote:
> > From: Kairui Song <kasong@xxxxxxxxxxx>
> >
> > To prepare for merging the swap_cgroup_ctrl into the swap table, store
> > the memcg info in the swap table on swapout.
> >
> > This is done by using the existing shadow format.
> >
> > Note this also changes the refault counting at the nearest online memcg
> > level:
> >
> > Unlike file folios, anon folios are mostly exclusive to one mem cgroup,
> > and each cgroup is likely to have different characteristics.
>
> This is not correct.
>
> As much as I like the idea of storing the swap_cgroup association
> inside the shadow entry, the refault evaluation needs to happen at the
> level that drove eviction.
>
> Consider a workload that is split into cgroups purely for accounting,
> not for setting different limits:
>
> workload (limit domain)
> `- component A
> `- component B
>
> This means the two components must compete freely, and it must behave
> as if there is only one LRU. When pages get reclaimed in a round-robin
> fashion, both A and B get aged at the same pace. Likewise, when pages
> in A refault, they must challenge the *combined* workingset of both A
> and B, not just the local pages.
>
> Otherwise, you risk retaining stale workingset in one subgroup while
> the other one is thrashing. This breaks userspace expectations.
>

Hi Johannes, thanks for pointing this out.

I'm just not sure how much of a real problem this is. The refault
challenge change was made in commit b910718a948a which was before anon
shadow was introduced. And shadows could get reclaimed, especially
when under pressure (and we could be doing that again by reclaiming
full_clusters with swap tables). And MGLRU simply ignores the
target_memcg here yet it performs surprisingly well with multiple
memcg setups. And I did find a comment in workingset.c saying the
kernel used to activate all pages, which is also fine. And that commit
also mentioned the active list shrinking, but anon active list gets
shrinked just fine without refault feedback in shrink_lruvec under
can_age_anon_pages.

So in this RFC I just be a bit aggressive and changed it. I can do
some tests with different memory size setup.

If we are not OK with it, then just use a ci->memcg_table then we are
fine, everything is still dynamic but single slot usage could be a bit
higher, 8 bytes to 10 bytes: and maybe find a way later to make
ci->memcg_table NULL and shrink back to 8 bytes with, e.g. MGLRU and
balance the memcg with things like aging feed back maybe (the later
part is just idea but seems doable?).