Re: [PATCH v3] mm/vmscan: fix demotion targets checks in reclaim/demotion

From: Waiman Long

Date: Fri Dec 26 2025 - 15:24:42 EST


On 12/23/25 4:19 PM, Bing Jiao wrote:
Fix two bugs in demote_folio_list() and can_demote() due to incorrect
demotion target checks in reclaim/demotion.

Commit 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
introduces the cpuset.mems_effective check and applies it to
can_demote(). However:

1. It does not apply this check in demote_folio_list(), which leads
to situations where pages are demoted to nodes that are
explicitly excluded from the task's cpuset.mems.

2. It checks only the nodes in the immediate next demotion hierarchy
and does not check all allowed demotion targets in can_demote().
This can cause pages to never be demoted if the nodes in the next
demotion hierarchy are not set in mems_effective.

These bugs break resource isolation provided by cpuset.mems.
This is visible from userspace because pages can either fail to be
demoted entirely or are demoted to nodes that are not allowed
in multi-tier memory systems.

To address these bugs, update cpuset_node_allowed() and
mem_cgroup_node_allowed() to return effective_mems, allowing directly
logic-and operation against demotion targets. Also update can_demote()
and demote_folio_list() accordingly.

Reproduct Bug 1:
Assume a system with 4 nodes, where nodes 0-1 are top-tier and
nodes 2-3 are far-tier memory. All nodes have equal capacity.

Test script:
echo 1 > /sys/kernel/mm/numa/demotion_enabled
mkdir /sys/fs/cgroup/test
echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
echo "0-2" > /sys/fs/cgroup/test/cpuset.mems
echo $$ > /sys/fs/cgroup/test/cgroup.procs
swapoff -a
# Expectation: Should respect node 0-2 limit.
# Observation: Node 3 shows significant allocation (MemFree drops)
stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1

Reproduct Bug 2:
Assume a system with 6 nodes, where nodes 0-2 are top-tier,
node 3 is a far-tier node, and nodes 4-5 are the farthest-tier nodes.
All nodes have equal capacity.

Test script:
echo 1 > /sys/kernel/mm/numa/demotion_enabled
mkdir /sys/fs/cgroup/test
echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
echo "0-2,4-5" > /sys/fs/cgroup/test/cpuset.mems
echo $$ > /sys/fs/cgroup/test/cgroup.procs
swapoff -a
# Expectation: Pages are demoted to Nodes 4-5
# Observation: No pages are demoted before oom.
stress-ng --oomable --vm 1 --vm-bytes 150% --mbind 0,1,2

Fixes: 7d709f49babc ("vmscan,cgroup: apply mems_effective to reclaim")
Cc: <stable@xxxxxxxxxxxxxxx>
Signed-off-by: Bing Jiao <bingjiao@xxxxxxxxxx>
---
include/linux/cpuset.h | 6 +++---
include/linux/memcontrol.h | 6 +++---
kernel/cgroup/cpuset.c | 16 ++++++++--------
mm/memcontrol.c | 6 ++++--
mm/vmscan.c | 35 +++++++++++++++++++++++------------
5 files changed, 41 insertions(+), 28 deletions(-)

diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index a98d3330385c..eb358c3aa9c0 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -174,7 +174,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
task_unlock(current);
}

-extern bool cpuset_node_allowed(struct cgroup *cgroup, int nid);
+extern nodemask_t cpuset_node_get_allowed(struct cgroup *cgroup);
#else /* !CONFIG_CPUSETS */

static inline bool cpusets_enabled(void) { return false; }
@@ -301,9 +301,9 @@ static inline bool read_mems_allowed_retry(unsigned int seq)
return false;
}

-static inline bool cpuset_node_allowed(struct cgroup *cgroup, int nid)
+static inline nodemask_t cpuset_node_get_allowed(struct cgroup *cgroup)
{
- return true;
+ return node_possible_map;
}

The nodemask_t type can be large depending on the setting of CONFIG_NODES_SHIFT. Passing a large data structure on stack may not be a good idea. You can return a pointer to nodemask_t instead. In that case, you will have a add a "const" qualifier to the return type to make sure that the node mask won't get accidentally modified. Alternatively, you can pass a nodemask_t pointer as an output parameter and copy out the nodemask_t data.

The name "cpuset_node_get_allowed" doesn't fit the cpuset naming convention. There is a "cpuset_mems_allowed(struct task_struct *)" to return "mems_allowed" of a task. This new helper is for returning the mems_allowed defined in the cpuset. Perhaps we could just use "cpuset_nodes_allowed(struct cgroup *)".

Cheers,
Longman