Re: [RFC PATCH 6/6] mm/memcontrol: Make memory.high tier-aware

From: Joshua Hahn

Date: Tue Mar 24 2026 - 11:48:41 EST


On Tue, 24 Mar 2026 16:21:06 +0530 Donet Tom <donettom@xxxxxxxxxxxxx> wrote:

>
> On 2/24/26 4:08 AM, Joshua Hahn wrote:
> > On machines serving multiple workloads whose memory is isolated via the
> > memory cgroup controller, it is currently impossible to enforce a fair
> > distribution of toptier memory among the workloads, as the only
> > enforcable limits have to do with total memory footprint, but not where
> > that memory resides.
> >
> > This makes ensuring a consistent and baseline performance difficult, as
> > each workload's performance is heavily impacted by workload-external
> > factors wuch as which other workloads are co-located in the same host,
> > and the order at which different workloads are started.
> >
> > Extend the existing memory.high protection to be tier-aware in the
> > charging and enforcement to limit toptier-hogging for workloads.
> >
> > Also, add a new nodemask parameter to try_to_free_mem_cgroup_pages,
> > which can be used to selectively reclaim from memory at the
> > memcg-tier interection of a cgroup.
> >
> > Signed-off-by: Joshua Hahn <joshua.hahnjy@xxxxxxxxx>
> > ---
> > include/linux/swap.h | 3 +-
> > mm/memcontrol-v1.c | 6 ++--
> > mm/memcontrol.c | 85 +++++++++++++++++++++++++++++++++++++-------
> > mm/vmscan.c | 11 +++---
> > 4 files changed, 84 insertions(+), 21 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 0effe3cc50f5..c6037ac7bf6e 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -368,7 +368,8 @@ extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> > unsigned long nr_pages,
> > gfp_t gfp_mask,
> > unsigned int reclaim_options,
> > - int *swappiness);
> > + int *swappiness,
> > + nodemask_t *allowed);
> > extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
> > gfp_t gfp_mask, bool noswap,
> > pg_data_t *pgdat,
> > diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> > index 0b39ba608109..29630c7f3567 100644
> > --- a/mm/memcontrol-v1.c
> > +++ b/mm/memcontrol-v1.c
> > @@ -1497,7 +1497,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
> > }
> >
> > if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> > - memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) {
> > + memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP,
> > + NULL, NULL)) {
> > ret = -EBUSY;
> > break;
> > }
> > @@ -1529,7 +1530,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
> > return -EINTR;
> >
> > if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> > - MEMCG_RECLAIM_MAY_SWAP, NULL))
> > + MEMCG_RECLAIM_MAY_SWAP,
> > + NULL, NULL))
> > nr_retries--;
> > }
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 8aa7ae361a73..ebd4a1b73c51 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2184,18 +2184,30 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
> >
> > do {
> > unsigned long pflags;
> > -
> > - if (page_counter_read(&memcg->memory) <=
> > - READ_ONCE(memcg->memory.high))
> > + nodemask_t toptier_nodes, *reclaim_nodes;
> > + bool mem_high_ok, toptier_high_ok;
> > +
> > + mt_get_toptier_nodemask(&toptier_nodes, NULL);
> > + mem_high_ok = page_counter_read(&memcg->memory) <=
> > + READ_ONCE(memcg->memory.high);
> > + toptier_high_ok = !(tier_aware_memcg_limits &&
> > + mem_cgroup_toptier_usage(memcg) >
> > + page_counter_toptier_high(&memcg->memory));
> > + if (mem_high_ok && toptier_high_ok)
> > continue;
> >
> > + if (mem_high_ok && !toptier_high_ok)
> > + reclaim_nodes = &toptier_nodes;
> > + else
> > + reclaim_nodes = NULL;
>
>
> IIUC The intent of this patch is to partition cgroup memory such that
> 0 → toptier_high is backed by higher-tier memory, and
> toptier_high → max is backed by lower-tier memory.
>
> Based on this:
>
> 1.If top-tier usage exceeds toptier_high, pages should be
>   demoted to the lower tier.
>
> 2. If lower-tier usage exceeds (max - toptier_high), pages
>   should be swapped out.
>
> 3. If total memory usage exceeds max, demotion should be
>   avoided and reclaim should directly swap out pages.
>
> I think we are only handling case (1) in this patch. When
> mem_high_ok && !toptier_high_ok, we are reclaiming pages (demotion first)
>
> However, if !mem_high_ok, the memcg reclaim path works as if
> there is no memory tiering  in cgroup. This can lead to more demotion
> and may eventually result in OOM.
>
> Should we also handle cases (2) and (3) in this patch?

Hello Donet! I hope you are doing well.

For the second condition, should pages be swapped out? If a workload
is using 0 toptier memory (extreme case, let's say they haven't set
memory.low) then lower-tier should be able to use all the way up to
max memory.

Maybe you mean if lowtier_usage exceeds (max - toptier_usage) pages
should be swapped out? But if we rearrange this

lowtier_usage >= max - toptier_usage
lowtier_usage + toptier_usage >= max
total_usage >= max

And this is just the memory.max check and is already handled by
existing reclaim semantics : -)

I think case 3 is a bit more nuanced. If we directly swap out from
high tier and skip demotions, this is introducing a priority inversion
since memory in toptier should be hotter than memory in lowtier, so
we should prefer to swap out the colder memory in lowtier before
swapping out memory in toptier.

The idea was discussed at length at [1]. It also feels like an orthogonal
discussion since the behavior isn't related to toptier high or low
behaviors.

Please let me know what you think. Thank you, I hope you have a great day!
Joshua

[1] https://lore.kernel.org/linux-mm/20260317230720.990329-3-bingjiao@xxxxxxxxxx/