Re: [PATCH v9 3/6] mm: memcontrol: add interface for swap tier selection
From: Yosry Ahmed
Date: Mon Jun 22 2026 - 18:26:46 EST
On Mon, Jun 22, 2026 at 3:10 PM Joshua Hahn <joshua.hahnjy@xxxxxxxxx> wrote:
>
> On Mon, 22 Jun 2026 14:21:30 -0700 Yosry Ahmed <yosry@xxxxxxxxxx> wrote:
>
> > On Sat, Jun 20, 2026 at 11:17 AM Youngjun Park <her0gyugyu@xxxxxxxxx> wrote:
> > >
> > > Introduce memory.swap.tiers.max, a flat-keyed file listing each
> > > tier defined in /sys/kernel/mm/swap/tiers with its state, "max"
> > > (allowed, the default) or "0" (disabled). A tier is one bit in the
> > > cgroup's tier mask, so writing "<tier> max" or "<tier> 0" sets or
> > > clears that bit.
> > >
> > > Since the current use case lacks amount control, it only supports
> > > "max" (on) and "0" (off). Therefore, it does not track per-tier swap
> > > usage, relying instead on a fast runtime bitmask check.
> > >
> > > We maintain both `mask` and `effective_mask`. The `effective_mask` is
> > > strictly bounded by the parent (e.g., if a parent is "0", the child's
> > > effective state is "0" even if its `mask` is "max"). Maintaining this
> > > separately avoids costly cgroup tree traversals to check ancestors at
> > > runtime.
> > >
> > > Suggested-by: Shakeel Butt <shakeel.butt@xxxxxxxxx>
> > > Suggested-by: Yosry Ahmed <yosry@xxxxxxxxxx>
> > > Signed-off-by: Youngjun Park <youngjun.park@xxxxxxx>
> > > ---
> > > Documentation/admin-guide/cgroup-v2.rst | 20 +++++
> > > Documentation/mm/swap-tier.rst | 9 +++
> > > include/linux/memcontrol.h | 5 ++
> > > mm/memcontrol.c | 67 ++++++++++++++++
> > > mm/swap_state.c | 5 +-
> > > mm/swap_tier.c | 102 +++++++++++++++++++++++-
> > > mm/swap_tier.h | 57 +++++++++++--
> > > 7 files changed, 255 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > > index 6efd0095ed99..4843ffcfd110 100644
> > > --- a/Documentation/admin-guide/cgroup-v2.rst
> > > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > > @@ -1850,6 +1850,26 @@ The following nested keys are defined.
> > > Swap usage hard limit. If a cgroup's swap usage reaches this
> > > limit, anonymous memory of the cgroup will not be swapped out.
> > >
> > > + memory.swap.tiers.max
> > > + A read-write flat-keyed file which exists on non-root
> > > + cgroups. The default is "max" for every tier.
>
> Hi Yosry,
>
> Sorry, I feel like I'm joining the party late. Apologies if I'm missing
> some context or repeating a discussion that's already been had.
> Please let me know if that is the case.
>
> One quick tangent:
> I was chatting with Nhat last week about swap tiers and its relation to
> memory tiering. Nhat brought up a good point, which is that while both
> swap tiers and memory tiers provide a clear hierarchy of performance,
> only memory tiering allows for movement between the tiers.
> AFAICT, swap tiering does not allow for direct migration from a higher
> tier swap backend to a lower tier swap backend if the higher tier
> backend runs out of memory.
>
> In that sense, I'm not entirely sure if we need to enforce similar
> semantics across swap tiering and memory tiering; it seems like there
> are some fundamental differences anyways to how we treat these tiers.
>
> > I wonder what should the default behavior be if memory.swap.max is set
> > to a value other than "max". Should the limits in
> > memory.swap.tiers.max auto-scale or remain as "max"? We probably want
> > to keep the behavior consistent with memory tiering.
> >
> > Shakeel/Joshua, WDYT?
>
> I think that the motivation behind these tiers is different for swap
> and memory. Tiered memory limits is motivated by preventing one
> workload from conusming all of a valuable resource, while swap tiers
> seems more to do with excluding certain workloads from using performant
> tiers and ensuring other workloads stay on those performant tiers.
>
> IOW memory tiers exist for fairness, but it seems like swap tiers exist
> for workload performance tiering. But maybe there's a usecase out there
> that would want fairness to apply in the swap tiers as well that I am
> not seeing.
I am not sure what use cases exist, but I think it's possible we end
up wanting to enforce fairness for swap tiers as well. Maybe not as
aggressively as memory (e.g. to avoid wearing out SSDs), but maybe at
least proactively through userspace?
At the end of the day, faster swap tiers are also valuable resources
that we probably don't want a few workloads to hog. I also think the
interfaces being consistent makes everyone's lives easier, even if
it's a bit of an overkill for swap tiers.
>
> If that is the case, I think auto-scaling makes sense but can be a bit
> tricky, since there is no universal tiered ratio; each workload will
> have different tiers it can swap to, so they will all have to calculate
> their own ratios. Tiered memory limits escapes this difficulty since we
> assume all memory can be placed on all tiers, so we have a system-wide
> ratio : -)
Hmm I don't follow. It's also possible (maybe not initially) that a
memcg cannot use specific memory tiers, right? I am not sure what the
difference is.
>
> Let me know what you think! Have a great day :D
> Joshua