[PATCH v5 20/24] virt/steal_monitor: Provide default method to inc/dec preferred CPUs
From: Shrikanth Hegde
Date: Thu Jun 25 2026 - 08:51:21 EST
These methods will be used by the steal_monitor core in subsequent
patches. Default implementation are likely good enough for most archs.
decrease_preferred_cpus() - Called when there is high steal time. It needs
to decide which CPUs to mark as non-preferred and set that state.
increase_preferred_cpus() - Called when there is low steal time. It needs
to decide which CPUs to mark as preferred and set that state.
Default Implementations:
decrease_preferred_cpus()
- Get the last CPU in cpu_preferred_mask.
- Check if that last CPU belong to first housekeeping core. If so there
is nothing to do. This helps to keep at least one core as preferred.
This is to be safe under non-normal cases.
- If it is not first housekeeping core, get its sibling and mark them as
non-preferred. If they are nohz_full, enable the tick. push mechanism
relies on sched_tick.
increase_preferred_cpus()
- Get the first active non-preferred CPUs. This likely is the last
set of CPUs being marked as non-preferred.
- If there is no such CPU, i.e preferred is same as active. Nothing
todo further.
- If not, get the siblings of that core and mark them as preferred.
Note that clearing the tick isn't needed as that would be handled via
sched_can_stop_tick.
Using core instead of individual CPUs give better numbers as SMT is
quite common and some hypervisor such as powerVM does core scheduling.
Signed-off-by: Shrikanth Hegde <sshegde@xxxxxxxxxxxxx>
---
v4->v5:
- Modified for steal_monitor
drivers/virt/steal_monitor/defaults.c | 68 +++++++++++++++++++++++++++
drivers/virt/steal_monitor/sm_core.h | 4 ++
2 files changed, 72 insertions(+)
diff --git a/drivers/virt/steal_monitor/defaults.c b/drivers/virt/steal_monitor/defaults.c
index 17f57afacbe6..90ede838491f 100644
--- a/drivers/virt/steal_monitor/defaults.c
+++ b/drivers/virt/steal_monitor/defaults.c
@@ -25,3 +25,71 @@ u64 __weak get_system_steal_time(void)
return total_steal;
}
+
+/*
+ * Default implementation of decrementing the preferred CPUs based on steal
+ * time. This is simple logic and decrease the preferred CPUs by 1 core.
+ * It takes out the last core in the active & preferred.
+ *
+ * Ensure at least one housekeeping core is always kept as preferred
+ *
+ * Could be overwritten by arch specific handling. Arch must ensure
+ * preferred is always subset of active.
+ */
+
+#define get_core_mask(cpu) topology_sibling_cpumask(cpu)
+
+void __weak decrease_preferred_cpus(struct steal_monitor *ctx)
+{
+ int last_cpu, tmp_cpu;
+ int first_hk_cpu;
+
+ guard(cpus_read_lock)();
+
+ last_cpu = cpumask_last(cpu_preferred_mask);
+ first_hk_cpu = cpumask_first_and(housekeeping_cpumask(HK_TYPE_KERNEL_NOISE),
+ cpu_active_mask);
+ /*
+ * If the core belongs to the first housekeeping CPUs, no action is
+ * taken. This leaves at least one core preferred always.
+ * This ensures at least some CPUs are available to run.
+ */
+ if (cpumask_equal(get_core_mask(last_cpu), get_core_mask(first_hk_cpu)))
+ return;
+
+ /*
+ * set tick bit for nohz_full CPU to push the task out. Once the tasks
+ * are pushed out, bit will be cleared if there are no tasks.
+ */
+
+ for_each_cpu_and(tmp_cpu, get_core_mask(last_cpu), cpu_active_mask) {
+ set_cpu_preferred(tmp_cpu, false);
+ if (tick_nohz_full_cpu(tmp_cpu))
+ tick_nohz_dep_set_cpu(tmp_cpu, TICK_DEP_BIT_SCHED);
+ }
+}
+
+/*
+ * Default implementation of incrementing preferred CPUs based on steal
+ * time. This is simple logic and increases the preferred CPUs by 1 core.
+ * It adds the first core in active & !preferred
+ *
+ * Nothing to do if active == preferred
+ *
+ * Could be overwritten by arch specific handling. Arch must ensure
+ * preferred is subset of active.
+ */
+void __weak increase_preferred_cpus(struct steal_monitor *ctx)
+{
+ int first_cpu, tmp_cpu;
+
+ guard(cpus_read_lock)();
+
+ first_cpu = cpumask_first_andnot(cpu_active_mask, cpu_preferred_mask);
+ /* All CPUs are preferred. Nothing to increase further */
+ if (first_cpu >= nr_cpu_ids)
+ return;
+
+ for_each_cpu_and(tmp_cpu, get_core_mask(first_cpu), cpu_active_mask)
+ set_cpu_preferred(tmp_cpu, true);
+}
diff --git a/drivers/virt/steal_monitor/sm_core.h b/drivers/virt/steal_monitor/sm_core.h
index e09745a2b813..1857d6a9a295 100644
--- a/drivers/virt/steal_monitor/sm_core.h
+++ b/drivers/virt/steal_monitor/sm_core.h
@@ -10,6 +10,8 @@
#include <linux/cpumask.h>
#include <linux/workqueue.h>
#include <linux/kernel_stat.h>
+#include <linux/tick.h>
+#include <linux/sched/isolation.h>
struct steal_monitor {
struct delayed_work work;
@@ -24,4 +26,6 @@ struct steal_monitor {
extern struct steal_monitor sm_core_ctx;
u64 get_system_steal_time(void);
+void increase_preferred_cpus(struct steal_monitor *ctx);
+void decrease_preferred_cpus(struct steal_monitor *ctx);
#endif /* __VIRT_STEAL_CORE_H */
--
2.47.3