Re: [PATCH v9 3/6] mm: memcontrol: add interface for swap tier selection

From: Joshua Hahn

Date: Tue Jun 23 2026 - 14:56:36 EST

On Tue, 23 Jun 2026 11:10:32 -0700 Yosry Ahmed <yosry@xxxxxxxxxx> wrote:

> > To get back to the question of how the auto-tuning should work, the
> > main question is to which ratio we scale the swap limits to.
> > Do we set the swap limits proportional to how much swap is present
> > in the system, or how much swap is available to the cgroup?
> >
> > So if we have 3 swap tiers A, B, C, with 50G, 30G, and 20G capacity
> > respectively, how much should a cgroup with swap.max = 10G have if
> > it is limited to tiers A and B?
> >
> > This is what I was getting at earlier when I said we have to calculate
> > different ratios for different cgroups, based on what tiers they have
> > access to.
>
> That's a good question. I think the case that is particularly
> interesting is whether or not the limits of other tiers should change
> when another tier is disabled/enabled.
>
> So basically in your example, assuming everything starts as "max",
> when swap.max is set to 10G, the autoscaled limits would be: (tier A,
> 5G), (tier B, 3G), (tier C, 2G). Now the question becomes, if
> userspace sets the limit of tier C to 0, should the limits for tiers A
> and B change?
>
> On one hand, it's simpler to just keep the autoscaled limits unchanged
> in this case. However, this means that the effective swap limit is now
> 8G, which is not great :/
>
> The alternative is to recalculate all the limits when one of them
> changes, in which case the limits of A and B would change to 6.25G and
> 3.75G. But I don't know if this will work well if we allow custom
> limits. What happens if the limit of tier C is written as 1 (or 4096)
> instead of 0? It's effectively the same scenario, but the tier is
> technically allowed.

I think the one problem with this is that it becomes quite easy to
accidentally overcommit. As a toy example, if you have 10 workloads and
100G swap (as in the example I gave above), intuitively setting
swap.max = 10G for all 10 workloads shouldn't ever cause any contention
on capacity. But if you start excluding some tiers from some workloads,
you actually get overcommitting on the tiers that can service the
most workloads.

I am not sure how concerning swap overcommit was, but at least in the
memory tiering scenario accidental overcommitting of toptier memory
seemed bad enough that I wanted to avoid the problem entirely.

> The more I think about it, the more I realize it may be best to drop
> the autoscaling thing. I imagine memory tiering might run into similar
> issues too :/

And that's why I didn't include opt-in/opt-out for any of the tiers;
if you have system-wide ratios, there's no need to change the ratios
at all, and as long as the sum of your memory.limit for each workload
is under the total capacity, all tiers will also not be overcommitted.

Now, all of these complications aside, I think we might be overthinking
a bit here : -) The auto-scaling should just provide some sort of
"reasonable" default, the users can always override the per-tier
limits if they are unhappy with the autoscaled values.

In fact, maybe it even makes sense to have sum of swap tier limits >
swap.max.

(I actually recall having a really similar discussion when I was working
on weighted interleave auto-tuning a year ago, on how weights should be
set when switching between manually-set limits and relying on
auto-scaled defaults [1]. I don't think there's a need to follow this
convention, but we should think about what the expected behavior should
be if a user manually sets a limit, but later wants to go back to
auto-scaling limits).

Anyways, I think these are important questions. Youngjun, Nhat, Shakeel,
any thoughts from you all? : -)

[1] https://lore.kernel.org/all/8734hbiq7j.fsf@DESKTOP-5N7EMDA/