Re: [PATCH v8 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control

From: Nhat Pham

Date: Thu Jun 18 2026 - 08:38:04 EST

On Wed, Jun 17, 2026 at 9:47 PM YoungJun Park <youngjun.park@xxxxxxx> wrote:
>
> On Wed, Jun 17, 2026 at 01:50:49PM -0400, Nhat Pham wrote:
>
> > On Wed, Jun 17, 2026 at 1:34 AM Youngjun Park <youngjun.park@xxxxxxx> wrote:
> > >
> > > This is the v8 series of the swap tier patchset.
> > >
> > > Great thanks to Shakeel Butt and Yosry for the reviews and discussions [1].
> > > The main change in this version is the interface change to use
> > > memory.swap.tiers.max with '0' (disable) and 'max' (enable) values.
> > > This mechanism was suggested by Shakeel and Yosry
> >
> > I like this interface too :)
>
> > I think Yosry wants zswap as a tier, right?
> >
> > Just that without vswap, maybe don't allow it to be an tier of itself?
>
> With the current architecture, users cannot dynamically specify zswap as
> a tier, and zswap is a separate layer, so it is not tiered by itself.
>
> Once your vswap work lands, I think we can make the zswap
> become the default, top-level tier.
>
> After that, we can also look into cleaning up the zswap.writeback
> interface together.

SGTM if Yosry is happy with it :) FWIW, zswap is a conceptual tier,
whether we want it to express with your interface or not. This is just
interface clean-up work.

>
> > #2: Inter-tier promotion and demotion:
> > Promotion and demotion apply between tiers, not within a single
> > tier. The current interface defines only tier assignment; it does
> > not yet define when or how pages move between tiers. Two triggering
> > models are possible:
> >
> > > (a) User-triggered: userspace explicitly initiates migration between
> > > tiers (e.g. via a new interface or existing move_pages semantics).
> > > (b) Kernel-triggered: the kernel moves pages between tiers at
> > > appropriate points such as reclaim or refault.
> >
> > We'll likely need some kernel-triggered mechanism, or we'd have LRU inversion :)
> >
> > Cold pages will fill up fast tiers first, and more recent/warm pages
> > will land on slow tiers...
>
> Yeah, good point!
>
> > We'll also need to enforce isolation/fairness to make sure no wordload
> > hoard the fast tiers too (but that probably requires demotion
> > support).
>
> Right, that makes sense.
>
> BTW, One thing I am curious about, though, is whether there are strong
> real-world use cases that require demotion/promotion.
> Theoretically, this looks useful but it would be helpful to better understand
> the requirements from such deployments.

I think so, yeah. The LRU inversion problem above is one :) Hard to
make proper tiering without demotion.

Say I have a workload that have a SLO - for example a PSI target - but
don't particularly care about exact memory placement. To optimize
resource, we want to place the warmer stuff in fast tier, and the
coldest stuff in slow tier, etc. Having the ability to do demotion
derisk the initial placement - we can place things in the fast tier
initially (and rather aggressively), then as pages age and prove their
coldness, we can move them to slower and slower tier, etc.

Otherwise, what we end up with is really a placement preference
interface more than true tiering. Which is still useful especially
when co-tenant workloads have strict latency requirements, but perhaps
we don't need a full hierarchy-style interface for it? :)

The other use case is for fairness enforcement. We can (and probably
should) start with strict limits, but setting memory.swap.tier.max for
each cgroup is a bit of a drag, and it might leave stranded capacity
in cgroups that are allocated but not utilized their fast swap tier
capacity. If demotion is possible, we can let workloads use more than
what is fair, but then demote swap pages from swap tier to enforce
fairness when necessary...

Obviously, it's a moot point if there is no good mechanism to transfer
data one tier to another. The data might also be so cold that all of
this has diminishing returns, and moving things around cost more than
it's worth :) So I'm happy to start with something simple, then we can
figure out the next steps.

>
> > >
> > > #3: Per-VMA, per-process swap and BPF:
> > > Not just for memcg based swap, possible to extend Per-VMA or per-process
> > > swap. Or we can use it as BPF program.
> > >
> > > #4: Zswap and vswap tiering:
> > > Tiering applies to the vswap + zswap combination.
> > >
> > > #5: Vswap on/off control:
> > > Currently not supported. If a strong use case arises where vswap needs
> > > to be controlled by memcg, the tier interface could be used for it.
> >
> > +1.
> >
> > Also, per-si/per-tier per-CPU allocation caching? :) Kairui already
> > has a patch for it, IIUC, but if not it's pretty critical I'd say.
>
> Yes, I missed it. Thank you for addressing it.
> we need an implementation that integrates this with the per-CPU
> allocation currently implemented on the vswap side.
>
> If Kairui's patch lands, my patch #4 also can be optimized based on that.

Yup!!

>
> > BTW, can we add some selftests, to make sure the new interface works
> > as expected, and to have example programs for new users to model their
> > scripts after? :)
>
> Yes, I agree. I think selftests are necessary.
>
> Do you want them to be introduced in this patchset, or would it be okay
> to add them separately as follow-up work?

If you have to send another version, might as well include them :)

Otherwise a follow-up is good. Thanks in advance for keeping our
codebase tested!

I'll take a look at the exact implementation on the swap side later,
but I suspect nothing much will have changed :)