Re: [PATCH v9 3/6] mm: memcontrol: add interface for swap tier selection

From: Yosry Ahmed

Date: Tue Jun 23 2026 - 16:06:40 EST


On Tue, Jun 23, 2026 at 11:56 AM Joshua Hahn <joshua.hahnjy@xxxxxxxxx> wrote:
>
> On Tue, 23 Jun 2026 11:10:32 -0700 Yosry Ahmed <yosry@xxxxxxxxxx> wrote:
>
> > > To get back to the question of how the auto-tuning should work, the
> > > main question is to which ratio we scale the swap limits to.
> > > Do we set the swap limits proportional to how much swap is present
> > > in the system, or how much swap is available to the cgroup?
> > >
> > > So if we have 3 swap tiers A, B, C, with 50G, 30G, and 20G capacity
> > > respectively, how much should a cgroup with swap.max = 10G have if
> > > it is limited to tiers A and B?
> > >
> > > This is what I was getting at earlier when I said we have to calculate
> > > different ratios for different cgroups, based on what tiers they have
> > > access to.
> >
> > That's a good question. I think the case that is particularly
> > interesting is whether or not the limits of other tiers should change
> > when another tier is disabled/enabled.
> >
> > So basically in your example, assuming everything starts as "max",
> > when swap.max is set to 10G, the autoscaled limits would be: (tier A,
> > 5G), (tier B, 3G), (tier C, 2G). Now the question becomes, if
> > userspace sets the limit of tier C to 0, should the limits for tiers A
> > and B change?
> >
> > On one hand, it's simpler to just keep the autoscaled limits unchanged
> > in this case. However, this means that the effective swap limit is now
> > 8G, which is not great :/
> >
> > The alternative is to recalculate all the limits when one of them
> > changes, in which case the limits of A and B would change to 6.25G and
> > 3.75G. But I don't know if this will work well if we allow custom
> > limits. What happens if the limit of tier C is written as 1 (or 4096)
> > instead of 0? It's effectively the same scenario, but the tier is
> > technically allowed.
>
> I think the one problem with this is that it becomes quite easy to
> accidentally overcommit. As a toy example, if you have 10 workloads and
> 100G swap (as in the example I gave above), intuitively setting
> swap.max = 10G for all 10 workloads shouldn't ever cause any contention
> on capacity. But if you start excluding some tiers from some workloads,
> you actually get overcommitting on the tiers that can service the
> most workloads.
>
> I am not sure how concerning swap overcommit was, but at least in the
> memory tiering scenario accidental overcommitting of toptier memory
> seemed bad enough that I wanted to avoid the problem entirely.
>
> > The more I think about it, the more I realize it may be best to drop
> > the autoscaling thing. I imagine memory tiering might run into similar
> > issues too :/
>
> And that's why I didn't include opt-in/opt-out for any of the tiers;
> if you have system-wide ratios, there's no need to change the ratios
> at all, and as long as the sum of your memory.limit for each workload
> is under the total capacity, all tiers will also not be overcommitted.

I think eventually there may be use cases to opt some memcgs out for
some memory tiers. For example, limit sensitive workloads to the top
tier (or vice versa).

>
> Now, all of these complications aside, I think we might be overthinking
> a bit here : -) The auto-scaling should just provide some sort of
> "reasonable" default, the users can always override the per-tier
> limits if they are unhappy with the autoscaled values.

I agree, but it seems like both options are not ideal here. I think it
might make more sense to not present a default value at all, have
"max" be the default for all the tiers, even if memory.max or swap.max
isn't. Userspace can set the limits if they need to. Autoscaling the
limits in userspace should be easy.

>
> In fact, maybe it even makes sense to have sum of swap tier limits >
> swap.max.
>
> (I actually recall having a really similar discussion when I was working
> on weighted interleave auto-tuning a year ago, on how weights should be
> set when switching between manually-set limits and relying on
> auto-scaled defaults [1]. I don't think there's a need to follow this
> convention, but we should think about what the expected behavior should
> be if a user manually sets a limit, but later wants to go back to
> auto-scaling limits).
>
> Anyways, I think these are important questions. Youngjun, Nhat, Shakeel,
> any thoughts from you all? : -)
>
> [1] https://lore.kernel.org/all/8734hbiq7j.fsf@DESKTOP-5N7EMDA/