Re: [PATCH v9 3/6] mm: memcontrol: add interface for swap tier selection

From: Joshua Hahn

Date: Mon Jun 22 2026 - 19:20:03 EST

On Mon, 22 Jun 2026 15:26:17 -0700 Yosry Ahmed <yosry@xxxxxxxxxx> wrote:

> On Mon, Jun 22, 2026 at 3:10 PM Joshua Hahn <joshua.hahnjy@xxxxxxxxx> wrote:
> >
> > On Mon, 22 Jun 2026 14:21:30 -0700 Yosry Ahmed <yosry@xxxxxxxxxx> wrote:
> >
> > > On Sat, Jun 20, 2026 at 11:17 AM Youngjun Park <her0gyugyu@xxxxxxxxx> wrote:
> > > >
> > > > Introduce memory.swap.tiers.max, a flat-keyed file listing each
> > > > tier defined in /sys/kernel/mm/swap/tiers with its state, "max"
> > > > (allowed, the default) or "0" (disabled). A tier is one bit in the
> > > > cgroup's tier mask, so writing "<tier> max" or "<tier> 0" sets or
> > > > clears that bit.
> > > >
> > > > Since the current use case lacks amount control, it only supports
> > > > "max" (on) and "0" (off). Therefore, it does not track per-tier swap
> > > > usage, relying instead on a fast runtime bitmask check.
> > > >
> > > > We maintain both `mask` and `effective_mask`. The `effective_mask` is
> > > > strictly bounded by the parent (e.g., if a parent is "0", the child's
> > > > effective state is "0" even if its `mask` is "max"). Maintaining this
> > > > separately avoids costly cgroup tree traversals to check ancestors at
> > > > runtime.
> > > >
> > > > Suggested-by: Shakeel Butt <shakeel.butt@xxxxxxxxx>
> > > > Suggested-by: Yosry Ahmed <yosry@xxxxxxxxxx>
> > > > Signed-off-by: Youngjun Park <youngjun.park@xxxxxxx>
> > > > ---
> > > > Documentation/admin-guide/cgroup-v2.rst | 20 +++++
> > > > Documentation/mm/swap-tier.rst | 9 +++
> > > > include/linux/memcontrol.h | 5 ++
> > > > mm/memcontrol.c | 67 ++++++++++++++++
> > > > mm/swap_state.c | 5 +-
> > > > mm/swap_tier.c | 102 +++++++++++++++++++++++-
> > > > mm/swap_tier.h | 57 +++++++++++--
> > > > 7 files changed, 255 insertions(+), 10 deletions(-)
> > > >
> > > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > > > index 6efd0095ed99..4843ffcfd110 100644
> > > > --- a/Documentation/admin-guide/cgroup-v2.rst
> > > > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > > > @@ -1850,6 +1850,26 @@ The following nested keys are defined.
> > > > Swap usage hard limit. If a cgroup's swap usage reaches this
> > > > limit, anonymous memory of the cgroup will not be swapped out.
> > > >
> > > > + memory.swap.tiers.max
> > > > + A read-write flat-keyed file which exists on non-root
> > > > + cgroups. The default is "max" for every tier.
> >
> > Hi Yosry,
> >
> > Sorry, I feel like I'm joining the party late. Apologies if I'm missing
> > some context or repeating a discussion that's already been had.
> > Please let me know if that is the case.
> >
> > One quick tangent:
> > I was chatting with Nhat last week about swap tiers and its relation to
> > memory tiering. Nhat brought up a good point, which is that while both
> > swap tiers and memory tiers provide a clear hierarchy of performance,
> > only memory tiering allows for movement between the tiers.
> > AFAICT, swap tiering does not allow for direct migration from a higher
> > tier swap backend to a lower tier swap backend if the higher tier
> > backend runs out of memory.
> >
> > In that sense, I'm not entirely sure if we need to enforce similar
> > semantics across swap tiering and memory tiering; it seems like there
> > are some fundamental differences anyways to how we treat these tiers.
> >
> > > I wonder what should the default behavior be if memory.swap.max is set
> > > to a value other than "max". Should the limits in
> > > memory.swap.tiers.max auto-scale or remain as "max"? We probably want
> > > to keep the behavior consistent with memory tiering.
> > >
> > > Shakeel/Joshua, WDYT?
> >
> > I think that the motivation behind these tiers is different for swap
> > and memory. Tiered memory limits is motivated by preventing one
> > workload from conusming all of a valuable resource, while swap tiers
> > seems more to do with excluding certain workloads from using performant
> > tiers and ensuring other workloads stay on those performant tiers.
> >
> > IOW memory tiers exist for fairness, but it seems like swap tiers exist
> > for workload performance tiering. But maybe there's a usecase out there
> > that would want fairness to apply in the swap tiers as well that I am
> > not seeing.
>
> I am not sure what use cases exist, but I think it's possible we end
> up wanting to enforce fairness for swap tiers as well. Maybe not as
> aggressively as memory (e.g. to avoid wearing out SSDs), but maybe at
> least proactively through userspace?
>
> At the end of the day, faster swap tiers are also valuable resources
> that we probably don't want a few workloads to hog. I also think the
> interfaces being consistent makes everyone's lives easier, even if
> it's a bit of an overkill for swap tiers.

I see, thank you for the explanation. That makes sense to me.

> > If that is the case, I think auto-scaling makes sense but can be a bit
> > tricky, since there is no universal tiered ratio; each workload will
> > have different tiers it can swap to, so they will all have to calculate
> > their own ratios. Tiered memory limits escapes this difficulty since we
> > assume all memory can be placed on all tiers, so we have a system-wide
> > ratio : -)
>
> Hmm I don't follow. It's also possible (maybe not initially) that a
> memcg cannot use specific memory tiers, right? I am not sure what the
> difference is.

You're right, I was speaking more to the current state of memory tiers.
The majority of the feedack I received was that we already have too
many memcg knobs, so I just opted to make tiered memcg limits a
cgroup mount, with no ability for individual memcgs to tune their
limits or opt-in/out.

What do you think Yosry? Would it make sense for us to be able to
tune these values? Personally I think it makes sense but just wanted to
make the basic features merged before I went to push for making those
knobs tunable.

If we want to make the tuning the same across swap & memory we should
probably align on the file names and how we interact with them.

Thanks,
Joshua