Re: [RFC PATCH 6/6] mm/memcontrol: Make memory.high tier-aware

From: Bing Jiao

Date: Wed Mar 11 2026 - 18:05:31 EST


On Mon, Feb 23, 2026 at 02:38:29PM -0800, Joshua Hahn wrote:
> @@ -4485,15 +4527,22 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
> return err;
>
> page_counter_set_high(&memcg->memory, high);
> + toptier_high = page_counter_toptier_high(&memcg->memory);
>
> if (of->file->f_flags & O_NONBLOCK)
> goto out;
>
> for (;;) {
> unsigned long nr_pages = page_counter_read(&memcg->memory);
> + unsigned long toptier_pages = mem_cgroup_toptier_usage(memcg);
> unsigned long reclaimed;
> + unsigned long to_free;
> + nodemask_t toptier_nodes, *reclaim_nodes;
> + bool mem_high_ok = nr_pages <= high;
> + bool toptier_high_ok = !(tier_aware_memcg_limits &&
> + toptier_pages > toptier_high);
>
> - if (nr_pages <= high)
> + if (mem_high_ok && toptier_high_ok)
> break;
>
> if (signal_pending(current))
> @@ -4505,8 +4554,17 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
> continue;
> }
>
> - reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> - GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL);
> + mt_get_toptier_nodemask(&toptier_nodes, NULL);
> + if (mem_high_ok && !toptier_high_ok) {
> + reclaim_nodes = &toptier_nodes;
> + to_free = toptier_pages - toptier_high;
> + } else {
> + reclaim_nodes = NULL;
> + to_free = nr_pages - high;
> + }
> + reclaimed = try_to_free_mem_cgroup_pages(memcg, to_free,
> + GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
> + NULL, reclaim_nodes);
>
> if (!reclaimed && !nr_retries--)
> break;

Hi Joshua, thanks for the patch.

I have a concern regarding the system behavior when both the total
memory.high limit and the new toptier_high limit are breached.

If both mem_high_ok and toptier_high are false, memory_high_write()
invokes try_to_free_mem_cgroup_pages() with reclaim_nodes set to NULL
to target all nodes. Under these conditions, the reclaimer might attempt
to satisfy the target bytes by demoting pages from the top-tier to lower
tiers. While this fulfills the toptier_high requirement, it fails to
reduce the total memory charge for the cgroup because the counter tracks
the sum across all tiers. Consequently, since the total memory usage
remains unchanged, the reclaimer will likely become trapped in the loop
until it reaches MAX_RECLAIM_RETRIES and other situations (e.g.,
both !reclaimed && !nr_retries–), leading to excessive CPU consumption
without successfully bringing the cgroup below its total memory limit,
or causing all top-tier pages demoted to far-tier, or causing premature
OOM kills.

Given your tier-aware memcg limits, I think it is better to reclaim from
lower tiers to swap to satisfy mem_high_ok by setting the allowed nodemask
to far-tier nodes. Then demote pages from top tiers to ensure
toptier_high is okay. This also prevents reclaiming pages directly from
top tiers to swap and ensures that demotion actually contributes to
reaching the targeted memory state without unnecessary performance
penalties.

To address the issue where a memcg exceeds its total limit and demotion
cannot help to relief the memory memcg pressure, I am considering to
introduce a reclaim_options setting that prevents page demotion by
setting sc.no_demote = 1. I have a local patch for this and am preparing
it for submission.

Please let me know if I have misunderstood any part of your
implementation or if you see any issues with this proposed adjustment.

Best,
Bing