Re: [PATCH v5 06/24] sched/core: allow only preferred CPUs in is_cpu_allowed

From: Shrikanth Hegde

Date: Tue Jun 30 2026 - 02:26:57 EST

Hi Yury, Prateek,

On 6/29/26 9:44 AM, Shrikanth Hegde wrote:

Hi Yury.

Just as said on previous round. Please order your series such that the
core logic goes first, and all sorts of complications, like this
optimization, are appended at the end.

Ok. I will split it up into two patches.

One without any optimization but with comment explaining the rare case of N**2.
Second one at the end of the series, with a patch do the optimization.

---

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9e16946c9d62..fafedd52611f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2498,8 +2498,10 @@ static inline bool rq_has_pinned_tasks(struct rq *rq)
   * Per-CPU kthreads are allowed to run on !active && online CPUs, see
   * __set_cpus_allowed_ptr() and select_fallback_rq().
   */
-static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
+static inline bool is_cpu_allowed(struct task_struct *p, int cpu, int cached)
{
+       bool task_check_preferred_cpu;
+
         /* When not in the task's cpumask, no point in looking further. */
         if (!task_allowed_on_cpu(p, cpu))
                 return false;
@@ -2508,9 +2510,24 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
         if (is_migration_disabled(p))
                 return cpu_online(cpu);
+       /*
+        * This is essential to maintain user affinities when preferred
+        * CPUs change. A task pinned on non-preferred CPU should continue
+        * to run there, since this is non-user triggered.
+        *
+        * If CPU is non-preferred and task can run on other CPUs which are
+        * currently preferred, then choose those other CPUs instead.
+        * Overhead is minimal when CPU is preferred.
+        */
+       task_check_preferred_cpu = !cpu_preferred(cpu) &&
+                                  task_has_preferred_cpus(p, cached);
+
         /* Non kernel threads are not allowed during either online or offline. */
-       if (!(p->flags & PF_KTHREAD))
+       if (!(p->flags & PF_KTHREAD)) {
+               if (task_check_preferred_cpu)
+                       return false;
                 return cpu_active(cpu);
+       }
         /* KTHREAD_IS_PER_CPU is always allowed. */
         if (kthread_is_per_cpu(p))
@@ -2520,6 +2537,10 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
         if (cpu_dying(cpu))
                 return false;
+       /* Try on preferred CPU first if possible*/
+       if (task_check_preferred_cpu)
+               return false;
+
         /* But are allowed during online. */
         return cpu_online(cpu);
}
@@ -2595,7 +2616,7 @@ static struct rq *__migrate_task(struct rq *rq, struct rq_flags *rf,
         __must_hold(__rq_lockp(rq))
{
         /* Affinity changed (again). */
-       if (!is_cpu_allowed(p, dest_cpu))
+       if (!is_cpu_allowed(p, dest_cpu, NO_CACHED_VAL))
                 return rq;

This thing I really dislike. The unrelated code should not be
affected. You can make it less visually invasive with:
         #define is_cpu_allowed(p, cpu) __is_cpu_allowed(p, cpu, NO_CACHED_VAL)

Please reconsider your code to have the changes better localized.

Thanks,
Yury

That was typed out too fast. I did refactor something like that later.
But i will split this into twp patches as said above.

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9e16946c9d62..a1b21c21aa9c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2498,8 +2498,11 @@ static inline bool rq_has_pinned_tasks(struct rq *rq)
* Per-CPU kthreads are allowed to run on !active && online CPUs, see
* __set_cpus_allowed_ptr() and select_fallback_rq().
*/
-static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
+static inline bool __is_cpu_allowed(struct task_struct *p, int cpu,
+                                   int pref_state)
{
+       bool task_check_preferred_cpu;
+
        /* When not in the task's cpumask, no point in looking further. */
        if (!task_allowed_on_cpu(p, cpu))
                return false;
@@ -2508,9 +2511,24 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
        if (is_migration_disabled(p))
                return cpu_online(cpu);

+       /*
+        * This is essential to maintain user affinities when preferred
+        * CPUs change. A task pinned on non-preferred CPU should continue
+        * to run there, since this is non-user triggered.
+        *
+        * If CPU is non-preferred and task can run on other CPUs which are
+        * currently preferred, then choose those other CPUs instead.
+        * Overhead is minimal when CPU is preferred.
+        */
+       task_check_preferred_cpu = !cpu_preferred(cpu) &&
+                                  task_has_preferred_cpus(p, pref_state);
+
        /* Non kernel threads are not allowed during either online or offline. */
-       if (!(p->flags & PF_KTHREAD))
+       if (!(p->flags & PF_KTHREAD)) {
+               if (task_check_preferred_cpu)
+                       return false;
                return cpu_active(cpu);
+       }

        /* KTHREAD_IS_PER_CPU is always allowed. */
        if (kthread_is_per_cpu(p))
@@ -2520,10 +2538,19 @@ static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
        if (cpu_dying(cpu))
                return false;

+       /* Try on preferred CPU first if possible*/
+       if (task_check_preferred_cpu)
+               return false;
+
        /* But are allowed during online. */
        return cpu_online(cpu);
}

+static inline bool is_cpu_allowed(struct task_struct *p, int cpu)
+{
+       return __is_cpu_allowed(p, cpu, PREFERRED_CPU_UNKNOWN);
+}
+
/*
* This is how migration works:
*
@@ -3547,7 +3574,15 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
        int nid = cpu_to_node(cpu);
        const struct cpumask *nodemask = NULL;
        enum { cpuset, possible, fail } state = cpuset;
-       int dest_cpu;
+       int dest_cpu, pref_state;
+
+       /*
+        * Cache the value whether task's affinity spans preferred CPUs.
+        * This helps to avoid repeating the same for each CPU
+        * later in the loop.
+        */
+       pref_state = task_has_preferred_cpus(p, PREFERRED_CPU_UNKNOWN) ?
+                       PREFERRED_CPU_EXISTS : PREFERRED_CPU_NONE;

        /*
         * If the node that the CPU is on has been offlined, cpu_to_node()
@@ -3559,7 +3594,7 @@ static int select_fallback_rq(int cpu, struct task_struct *p)

                /* Look for allowed, online CPU in same node. */
                for_each_cpu(dest_cpu, nodemask) {
-                       if (is_cpu_allowed(p, dest_cpu))
+                       if (__is_cpu_allowed(p, dest_cpu, pref_state))
                                return dest_cpu;
                }
        }
@@ -3567,7 +3602,7 @@ static int select_fallback_rq(int cpu, struct task_struct *p)
        for (;;) {
                /* Any allowed, online CPU? */
                for_each_cpu(dest_cpu, p->cpus_ptr) {
-                       if (!is_cpu_allowed(p, dest_cpu))
+                       if (!__is_cpu_allowed(p, dest_cpu, pref_state))
                                continue;

                        goto out;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c7c2dea65edd..6a352d235503 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -4213,4 +4213,33 @@ DEFINE_CLASS_IS_UNCONDITIONAL(sched_change)

#include "ext.h"

+/*
+ * PREFERRED_CPU_UNKNOWN: Unknown and need to evaluate.
+ * PREFERRED_CPU_NONE : Known and No preferred CPUs exists in task's affinity.
+ * PREFERRED_CPU_EXISTS: Known and preferred CPU exists in task's affinity.
+ */
+
+enum task_preferred_cached {
+       PREFERRED_CPU_UNKNOWN,
+       PREFERRED_CPU_NONE,
+       PREFERRED_CPU_EXISTS,
+};
+
+/*
+ * Value is known when called via select_fallback_rq(). This helps to
+ * avoid calling cpumask_intersects repeatedly in the loop.
+ *
+ * Only affects FAIR task.
+ */
+static inline bool task_has_preferred_cpus(struct task_struct *p, int pref_state)
+{
+       /* Only FAIR tasks honor preferred CPU state */
+       if (unlikely(p->sched_class != &fair_sched_class))
+               return false;
+
+       if (pref_state != PREFERRED_CPU_UNKNOWN)
+               return pref_state == PREFERRED_CPU_EXISTS;
+
+       return cpumask_intersects(p->cpus_ptr, cpu_preferred_mask);
+}
#endif /* _KERNEL_SCHED_SCHED_H */

I was thinking caching the value can cause affinity to be reset,
because there is no protection of mask change within the loop of
select_fallback_rq. So i did some testing, and
I hit a case in practice with 10ms as interval with explicit
affinities on overlapping preferred and non-preferred CPUs and hotplugging
specific CPUs. caching exposes this race

Lets say task affined to 464-479 and preferred mask is 0-471 (472-479 is non preferred)

CPU468 CPU0
select_fallback_rq
- pref_state = PREFERRED_CPU_EXISTS
(Now preferred_mask becomes 0-463, but (Changes preferred to 0-463)
before any further call to is_cpu_allowed
is made)
- is_cpu_allowed tries to find a preferred
CPU since cached state says one exists.
- is_cpu_allowed is called twice (once on nodemask,
and once on p->cpus_ptr) but cached state remains the
same.
- no CPU found, fallback to reset to possible CPUs.

Without cached state, there is evaluation on each !preferred CPU
and such race isn;t possible between two calls. So chance of race
is extremely rare if non-existent.
I couldn't hit the same race in any permutations I tried.

Even if we take task was affined to only one CPU and mask changed between &&.
!cpu_preferred(cpu) && task_has_preferred_cpus(p);

Two cases.
Case 1: cpu was marked as preferred and after && it got removed from preferred_mask.
In that case task may end up on non-preferred CPU and it gets pushed out if possible.
No reset of its affinity.

Case 2: cpu was non-preferred and now it became preferred after && Now cpumask_intersects
will be true true and task_has_preferred_cpus is true as well, but this CPU will be skipped.
But second call in select_fallback_rq will ensure it returns the since cpu_preferred check
will succeed. Since select_fallback_rq is called on task_cpu(p) where it previously ran,
first check may fail due to race, but second one can't since evaluation between the two can't
be more than 1ms.

* So I will drop this optimization of caching the state for now *