It is possible for a reclaimer to cause demotions of an lruvec belongingYou should also define an inline function for the !CONFIG_CPUSETS case.
to a cgroup with cpuset.mems set to exclude some nodes. Attempt to apply
this limitation based on the lruvec's memcg and prevent demotion.
Notably, this may still allow demotion of shared libraries or any memory
first instantiated in another cgroup. This means cpusets still cannot
cannot guarantee complete isolation when demotion is enabled, and the
docs have been updated to reflect this.
Note: This is a fairly hacked up method that probably overlooks some
cgroup/cpuset controls or designs. RFCing now for some discussion
at LSFMM '25.
Signed-off-by: Gregory Price <gourry@xxxxxxxxxx>
---
.../ABI/testing/sysfs-kernel-mm-numa | 14 +++++---
include/linux/cpuset.h | 2 ++
kernel/cgroup/cpuset.c | 10 ++++++
mm/vmscan.c | 32 ++++++++++++-------
4 files changed, 41 insertions(+), 17 deletions(-)
diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-numa b/Documentation/ABI/testing/sysfs-kernel-mm-numa
index 77e559d4ed80..27cdcab901f7 100644
--- a/Documentation/ABI/testing/sysfs-kernel-mm-numa
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-numa
@@ -16,9 +16,13 @@ Description: Enable/disable demoting pages during reclaim
Allowing page migration during reclaim enables these
systems to migrate pages from fast tiers to slow tiers
when the fast tier is under pressure. This migration
- is performed before swap. It may move data to a NUMA
- node that does not fall into the cpuset of the
- allocating process which might be construed to violate
- the guarantees of cpusets. This should not be enabled
- on systems which need strict cpuset location
+ is performed before swap if an eligible numa node is
+ present in cpuset.mems for the cgroup. If cpusets.mems
+ changes at runtime, it may move data to a NUMA node that
+ does not fall into the cpuset of the new cpusets.mems,
+ which might be construed to violate the guarantees of
+ cpusets. Shared memory, such as libraries, owned by
+ another cgroup may still be demoted and result in memory
+ use on a node not present in cpusets.mem. This should not
+ be enabled on systems which need strict cpuset location
guarantees.
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 835e7b793f6a..d4169f1b1719 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -171,6 +171,8 @@ static inline void set_mems_allowed(nodemask_t nodemask)
task_unlock(current);
}
+bool memcg_mems_allowed(struct mem_cgroup *memcg, int nid);
+
#else /* !CONFIG_CPUSETS */
static inline bool cpusets_enabled(void) { return false; }
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 0f910c828973..bb9669cc105d 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -4296,3 +4296,13 @@ void cpuset_task_status_allowed(struct seq_file *m, struct task_struct *task)
seq_printf(m, "Mems_allowed_list:\t%*pbl\n",
nodemask_pr_args(&task->mems_allowed));
}
+
+bool memcg_mems_allowed(struct mem_cgroup *memcg, int nid)
+{
+ struct cgroup_subsys_state *css;
+ struct cpuset *cs;
+
+ css = cgroup_get_e_css(memcg->css.cgroup, &cpuset_cgrp_subsys);
+ cs = css ? container_of(css, struct cpuset, css) : NULL;
+ return cs ? node_isset(nid, cs->effective_mems) : true;