Re: [RFC PATCH 6/6] mm/memcontrol: Make memory.high tier-aware

From: Joshua Hahn

Date: Thu Mar 12 2026 - 15:45:52 EST

On Wed, 11 Mar 2026 22:05:16 +0000 Bing Jiao <bingjiao@xxxxxxxxxx> wrote:

> On Mon, Feb 23, 2026 at 02:38:29PM -0800, Joshua Hahn wrote:
> > @@ -4485,15 +4527,22 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
> > return err;
> >
> > page_counter_set_high(&memcg->memory, high);
> > + toptier_high = page_counter_toptier_high(&memcg->memory);
> >
> > if (of->file->f_flags & O_NONBLOCK)
> > goto out;
> >
> > for (;;) {
> > unsigned long nr_pages = page_counter_read(&memcg->memory);
> > + unsigned long toptier_pages = mem_cgroup_toptier_usage(memcg);
> > unsigned long reclaimed;
> > + unsigned long to_free;
> > + nodemask_t toptier_nodes, *reclaim_nodes;
> > + bool mem_high_ok = nr_pages <= high;
> > + bool toptier_high_ok = !(tier_aware_memcg_limits &&
> > + toptier_pages > toptier_high);
> >
> > - if (nr_pages <= high)
> > + if (mem_high_ok && toptier_high_ok)
> > break;
> >
> > if (signal_pending(current))
> > @@ -4505,8 +4554,17 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
> > continue;
> > }
> >
> > - reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> > - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL);
> > + mt_get_toptier_nodemask(&toptier_nodes, NULL);
> > + if (mem_high_ok && !toptier_high_ok) {
> > + reclaim_nodes = &toptier_nodes;
> > + to_free = toptier_pages - toptier_high;
> > + } else {
> > + reclaim_nodes = NULL;
> > + to_free = nr_pages - high;
> > + }
> > + reclaimed = try_to_free_mem_cgroup_pages(memcg, to_free,
> > + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
> > + NULL, reclaim_nodes);
> >
> > if (!reclaimed && !nr_retries--)
> > break;
>
> Hi Joshua, thanks for the patch.

Hello Bing!

I hope you are doing well, thank you for reviewing my patch : -)

> I have a concern regarding the system behavior when both the total
> memory.high limit and the new toptier_high limit are breached.
>
> If both mem_high_ok and toptier_high are false, memory_high_write()
> invokes try_to_free_mem_cgroup_pages() with reclaim_nodes set to NULL
> to target all nodes. Under these conditions, the reclaimer might attempt
> to satisfy the target bytes by demoting pages from the top-tier to lower
> tiers. While this fulfills the toptier_high requirement, it fails to
> reduce the total memory charge for the cgroup because the counter tracks
> the sum across all tiers. Consequently, since the total memory usage
> remains unchanged, the reclaimer will likely become trapped in the loop
> until it reaches MAX_RECLAIM_RETRIES and other situations (e.g.,
> both !reclaimed && !nr_retries–), leading to excessive CPU consumption
> without successfully bringing the cgroup below its total memory limit,
> or causing all top-tier pages demoted to far-tier, or causing premature
> OOM kills.

I agree with everything you mentioned above. However, I would like to note
that my series preserves the default behavior for when memory.high
is breached (since toptier_high is always <= memory.high), so
memory_high_write() would previously have this behavior as well where
shrink_folio_list would prefer to demote as opposed to swapping and
lead to the infinite loop.

In that sense I think that it might make sense to introduce a fix for this
that is orthogonal to this series. AFAICT I don't think this is introducing
any new harmful behaviors.

> Given your tier-aware memcg limits, I think it is better to reclaim from
> lower tiers to swap to satisfy mem_high_ok by setting the allowed nodemask
> to far-tier nodes. Then demote pages from top tiers to ensure
> toptier_high is okay. This also prevents reclaiming pages directly from
> top tiers to swap and ensures that demotion actually contributes to
> reaching the targeted memory state without unnecessary performance
> penalties.

If I understand this correctly, this would mean that each loop would:
1. swap out low tier
2. demote top tier

And repeat this cycle until we meet the memory.high limit?

I think this makes sense. I will note that once again I think that this
change is orthogonal to this series, as it deals with the memory.high
violation case and not the toptier violation case. Note that if only
toptier limit is violated, demotion from the toptier does make sense,
since in this case it will shrink the metric we care about.

> To address the issue where a memcg exceeds its total limit and demotion
> cannot help to relief the memory memcg pressure, I am considering to
> introduce a reclaim_options setting that prevents page demotion by
> setting sc.no_demote = 1. I have a local patch for this and am preparing
> it for submission.

I think this makes sense. Please do CC me in the patch if/when you do
send it upstream!

> Please let me know if I have misunderstood any part of your
> implementation or if you see any issues with this proposed adjustment.

I think you understood my patch completely as I intended : -)
>From my POV though, I just felt that the issues you mentioned actually have
to do with the standard memory reclaim infrastructure, and not necessarily
with the toptier high semantics.

And please let me know if you feel that I have not represented your
perspective as well! I hope you have a great day!!
Joshua