Re: [PATCH] mm: add memory.compact_unevictable_allowed cgroup attribute

From: Daniil Tatianin

Date: Wed Mar 18 2026 - 05:12:41 EST

On 3/18/26 11:25 AM, Michal Hocko wrote:

On Tue 17-03-26 23:17:28, Daniil Tatianin wrote:

On 3/17/26 10:17 PM, Andrew Morton wrote:

On Tue, 17 Mar 2026 13:00:58 +0300 Daniil Tatianin<d-tatianin@xxxxxxxxxxxxxx> wrote:

The current global sysctl compact_unevictable_allowed is too coarse.
In environments with mixed workloads, we may want to protect specific
important cgroups from compaction to ensure their stability and
responsiveness, while allowing compaction for others.

This patch introduces a per-memcg compact_unevictable_allowed attribute.
This allows granular control over whether unevictable pages in a specific
cgroup can be compacted. The global sysctl still takes precedence if set
to disallow compaction, but this new setting allows opting out specific
cgroups.

This also adds a new ISOLATE_UNEVICTABLE_CHECK_MEMCG flag to
isolate_migratepages_block to preserve the old behavior for the
ISOLATE_UNEVICTABLE flag unconditionally used by
isolage_migratepages_range.

AI review asked questions:
https://sashiko.dev/#/patchset/20260317100058.2316997-1-d-tatianin@xxxxxxxxxxxxxx
Should this dynamically walk up the ancestor chain during evaluation to
ensure it returns false if any ancestor has disallowed compaction?

I think ultimately it's up to cgroup maintainers whether the code should do
that, but as far as I understand the whole point of cgroups is that a child
can override the settings of its parent. Moreover, this property doesn't
have CFTYPE_NS_DELEGATABLE set, so a child cgroup cannot just toggle it at
will.

In general any attributes should have proper hieararchical semantic. I
am not sure what that should be in this case. What is a desire in a
child cgroup can become fragmentation pressure to others.

>
> I think it would be really important to explain more thoroughly about
> those usecases of mixed workloads.

I think there are many examples of a system where one process is more important than
others. For example, any sort of healthcheck or even the ssh daemon: these may become
unresponsive during heavy compaction due to thousands of TLB invalidate IPIs or page faulting
on pages that are being compacted. Another example is a VM that is responsible for routing
traffic of all other VMs or even the entire cluster, you really want to prioritize its responsiveness, while
still allowing compaction of memory for the rest of the system, for less important VMs or services etc.

> Is the memcg even a suitable level of
> abstraction for this tunable?

In my opinion it is, since it is relatively common to put all related tasks into one cgroup with preset memory limits etc.

> Doesn't this belong to tasks if anything?

I think it would be very difficult to implement as a per-task attribute properly since compaction works at the folio
level. While folios have a pointer to the memcg that owns them, they may be mapped by multiple process in case
of shared memory. We would have to find all the address spaces mapping this folio, and then check the property on
every one of them, which may be set to different values. This may be problematic performance-wise to do for
every physical page, and it also introduces unclear semantics if different address spaces mapping the same page
have different opinions.

(resend because of html formatting in the previous email)