Re: [PATCH v9 3/6] mm: memcontrol: add interface for swap tier selection

From: Joshua Hahn

Date: Tue Jun 23 2026 - 16:20:22 EST


On Tue, 23 Jun 2026 13:06:10 -0700 Yosry Ahmed <yosry@xxxxxxxxxx> wrote:

> On Tue, Jun 23, 2026 at 11:56 AM Joshua Hahn <joshua.hahnjy@xxxxxxxxx> wrote:
> >
> > On Tue, 23 Jun 2026 11:10:32 -0700 Yosry Ahmed <yosry@xxxxxxxxxx> wrote:
> >
> > > > To get back to the question of how the auto-tuning should work, the
> > > > main question is to which ratio we scale the swap limits to.
> > > > Do we set the swap limits proportional to how much swap is present
> > > > in the system, or how much swap is available to the cgroup?
> > > >
> > > > So if we have 3 swap tiers A, B, C, with 50G, 30G, and 20G capacity
> > > > respectively, how much should a cgroup with swap.max = 10G have if
> > > > it is limited to tiers A and B?
> > > >
> > > > This is what I was getting at earlier when I said we have to calculate
> > > > different ratios for different cgroups, based on what tiers they have
> > > > access to.
> > >
> > > That's a good question. I think the case that is particularly
> > > interesting is whether or not the limits of other tiers should change
> > > when another tier is disabled/enabled.
> > >
> > > So basically in your example, assuming everything starts as "max",
> > > when swap.max is set to 10G, the autoscaled limits would be: (tier A,
> > > 5G), (tier B, 3G), (tier C, 2G). Now the question becomes, if
> > > userspace sets the limit of tier C to 0, should the limits for tiers A
> > > and B change?
> > >
> > > On one hand, it's simpler to just keep the autoscaled limits unchanged
> > > in this case. However, this means that the effective swap limit is now
> > > 8G, which is not great :/
> > >
> > > The alternative is to recalculate all the limits when one of them
> > > changes, in which case the limits of A and B would change to 6.25G and
> > > 3.75G. But I don't know if this will work well if we allow custom
> > > limits. What happens if the limit of tier C is written as 1 (or 4096)
> > > instead of 0? It's effectively the same scenario, but the tier is
> > > technically allowed.
> >
> > I think the one problem with this is that it becomes quite easy to
> > accidentally overcommit. As a toy example, if you have 10 workloads and
> > 100G swap (as in the example I gave above), intuitively setting
> > swap.max = 10G for all 10 workloads shouldn't ever cause any contention
> > on capacity. But if you start excluding some tiers from some workloads,
> > you actually get overcommitting on the tiers that can service the
> > most workloads.
> >
> > I am not sure how concerning swap overcommit was, but at least in the
> > memory tiering scenario accidental overcommitting of toptier memory
> > seemed bad enough that I wanted to avoid the problem entirely.
> >
> > > The more I think about it, the more I realize it may be best to drop
> > > the autoscaling thing. I imagine memory tiering might run into similar
> > > issues too :/
> >
> > And that's why I didn't include opt-in/opt-out for any of the tiers;
> > if you have system-wide ratios, there's no need to change the ratios
> > at all, and as long as the sum of your memory.limit for each workload
> > is under the total capacity, all tiers will also not be overcommitted.
>
> I think eventually there may be use cases to opt some memcgs out for
> some memory tiers. For example, limit sensitive workloads to the top
> tier (or vice versa).

Yup, that makes sense to me too.

One of the things that did concern me a bit with my model for tiered
memcg limit was that system-critical processes would also be susceptible
to being demoted and churned, when we would much rather make sure
those are kept protected at the toptier.

> > Now, all of these complications aside, I think we might be overthinking
> > a bit here : -) The auto-scaling should just provide some sort of
> > "reasonable" default, the users can always override the per-tier
> > limits if they are unhappy with the autoscaled values.
>
> I agree, but it seems like both options are not ideal here. I think it
> might make more sense to not present a default value at all, have
> "max" be the default for all the tiers, even if memory.max or swap.max
> isn't. Userspace can set the limits if they need to. Autoscaling the
> limits in userspace should be easy.

I like this idea a lot. That would basically make swap tiers a no-op
unless you opt-into setting the limits yourself, so we don't run the
risk of accidentally enabling tiers.

On that note, maybe it makes sense for me to change my
memory tiering series to also just not present a default setting for
tiered limits, and instead just set them as max until the user comes
and configures them?

I think this is a better question for the memcg maintainers, who might
have more to say on this. Johannes, Michal, Roman, and Shakeel,
what do you guys think? Could an approach to just make the memory
tier limits writable from the get-go and not expose any defaults
make sense to you?

I think that would simplify the code quite a bit and also help mitigate
the possible side effects on system-critical workloads.

Thanks!
Joshua