Re: [PATCH v9 3/6] mm: memcontrol: add interface for swap tier selection

From: Joshua Hahn

Date: Mon Jun 22 2026 - 18:10:52 EST


On Mon, 22 Jun 2026 14:21:30 -0700 Yosry Ahmed <yosry@xxxxxxxxxx> wrote:

> On Sat, Jun 20, 2026 at 11:17 AM Youngjun Park <her0gyugyu@xxxxxxxxx> wrote:
> >
> > Introduce memory.swap.tiers.max, a flat-keyed file listing each
> > tier defined in /sys/kernel/mm/swap/tiers with its state, "max"
> > (allowed, the default) or "0" (disabled). A tier is one bit in the
> > cgroup's tier mask, so writing "<tier> max" or "<tier> 0" sets or
> > clears that bit.
> >
> > Since the current use case lacks amount control, it only supports
> > "max" (on) and "0" (off). Therefore, it does not track per-tier swap
> > usage, relying instead on a fast runtime bitmask check.
> >
> > We maintain both `mask` and `effective_mask`. The `effective_mask` is
> > strictly bounded by the parent (e.g., if a parent is "0", the child's
> > effective state is "0" even if its `mask` is "max"). Maintaining this
> > separately avoids costly cgroup tree traversals to check ancestors at
> > runtime.
> >
> > Suggested-by: Shakeel Butt <shakeel.butt@xxxxxxxxx>
> > Suggested-by: Yosry Ahmed <yosry@xxxxxxxxxx>
> > Signed-off-by: Youngjun Park <youngjun.park@xxxxxxx>
> > ---
> > Documentation/admin-guide/cgroup-v2.rst | 20 +++++
> > Documentation/mm/swap-tier.rst | 9 +++
> > include/linux/memcontrol.h | 5 ++
> > mm/memcontrol.c | 67 ++++++++++++++++
> > mm/swap_state.c | 5 +-
> > mm/swap_tier.c | 102 +++++++++++++++++++++++-
> > mm/swap_tier.h | 57 +++++++++++--
> > 7 files changed, 255 insertions(+), 10 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 6efd0095ed99..4843ffcfd110 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1850,6 +1850,26 @@ The following nested keys are defined.
> > Swap usage hard limit. If a cgroup's swap usage reaches this
> > limit, anonymous memory of the cgroup will not be swapped out.
> >
> > + memory.swap.tiers.max
> > + A read-write flat-keyed file which exists on non-root
> > + cgroups. The default is "max" for every tier.

Hi Yosry,

Sorry, I feel like I'm joining the party late. Apologies if I'm missing
some context or repeating a discussion that's already been had.
Please let me know if that is the case.

One quick tangent:
I was chatting with Nhat last week about swap tiers and its relation to
memory tiering. Nhat brought up a good point, which is that while both
swap tiers and memory tiers provide a clear hierarchy of performance,
only memory tiering allows for movement between the tiers.
AFAICT, swap tiering does not allow for direct migration from a higher
tier swap backend to a lower tier swap backend if the higher tier
backend runs out of memory.

In that sense, I'm not entirely sure if we need to enforce similar
semantics across swap tiering and memory tiering; it seems like there
are some fundamental differences anyways to how we treat these tiers.

> I wonder what should the default behavior be if memory.swap.max is set
> to a value other than "max". Should the limits in
> memory.swap.tiers.max auto-scale or remain as "max"? We probably want
> to keep the behavior consistent with memory tiering.
>
> Shakeel/Joshua, WDYT?

I think that the motivation behind these tiers is different for swap
and memory. Tiered memory limits is motivated by preventing one
workload from conusming all of a valuable resource, while swap tiers
seems more to do with excluding certain workloads from using performant
tiers and ensuring other workloads stay on those performant tiers.

IOW memory tiers exist for fairness, but it seems like swap tiers exist
for workload performance tiering. But maybe there's a usecase out there
that would want fairness to apply in the swap tiers as well that I am
not seeing.

If that is the case, I think auto-scaling makes sense but can be a bit
tricky, since there is no universal tiered ratio; each workload will
have different tiers it can swap to, so they will all have to calculate
their own ratios. Tiered memory limits escapes this difficulty since we
assume all memory can be placed on all tiers, so we have a system-wide
ratio : -)

Let me know what you think! Have a great day :D
Joshua