Re: [PATCH RFC 06/15] memcg, swap: reparent the swap entry on swapin if swapout cgroup is dead

From: Johannes Weiner

Date: Mon Feb 23 2026 - 11:25:44 EST

On Fri, Feb 20, 2026 at 07:42:07AM +0800, Kairui Song via B4 Relay wrote:
> From: Kairui Song <kasong@xxxxxxxxxxx>
>
> As a result this will always charge the swapin folio into the dead
> cgroup's parent cgroup, and ensure folio->swap belongs to folio_memcg.
> This only affects some uncommon behavior if we move the process between
> memcg.
>
> When a process that previously swapped some memory is moved to another
> cgroup, and the cgroup where the swap occurred is dead, folios for
> swap in of old swap entries will be charged into the new cgroup.
> Combined with the lazy freeing of swap cache, this leads to a strange
> situation where the folio->swap entry belongs to a cgroup that is not
> folio->memcg.
>
> Swapin from dead zombie memcg might be rare in practise, cgroups are
> offlined only after the workload in it is gone, which requires zapping
> the page table first, and releases all swap entries. Shmem is
> a bit different, but shmem always has swap count == 1, and force
> releases the swap cache. So, for shmem charging into the new memcg and
> release entry does look more sensible.
>
> However, to make things easier to understand for an RFC, let's just
> always charge to the parent cgroup if the leaf cgroup is dead. This may
> not be the best design, but it makes the following work much easier to
> demonstrate.
>
> For a better solution, we can later:
>
> - Dynamically allocate a swap cluster trampoline cgroup table
> (ci->memcg_table) and use that for zombie swapin only. Which is
> actually OK and may not cause a mess in the code level, since the
> incoming swap table compaction will require table expansion on swap-in
> as well.
>
> - Just tolerate a 2-byte per slot overhead all the time, which is also
> acceptable.
>
> - Limit the charge to parent behavior to only one situation: when the
> swap count > 2 and the process is migrated to another cgroup after
> swapout, these entries. This is even more rare to see in practice, I
> think.
>
> For reference, the memory ownership model of cgroup v2:
>
> """
> A memory area is charged to the cgroup which instantiated it and stays
> charged to the cgroup until the area is released. Migrating a process
> to a different cgroup doesn't move the memory usages that it
> instantiated while in the previous cgroup to the new cgroup.
>
> A memory area may be used by processes belonging to different cgroups.
> To which cgroup the area will be charged is in-deterministic; however,
> over time, the memory area is likely to end up in a cgroup which has
> enough memory allowance to avoid high reclaim pressure.
>
> If a cgroup sweeps a considerable amount of memory which is expected
> to be accessed repeatedly by other cgroups, it may make sense to use
> POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
> belonging to the affected files to ensure correct memory ownership.
> """
>
> So I think all of the solutions mentioned above, including this commit,
> are not wrong.
>
> Signed-off-by: Kairui Song <kasong@xxxxxxxxxxx>

Those semantics look good to me. I think it's better than the status
quo, actually.