Re: Re: [RFC PATCH 2/2] sched/fair: Do not specialcase SCHED_IDLE cpus in select slowpath
From: Abel Wu
Date: Tue Mar 11 2025 - 12:43:55 EST
Hi Josh,
On 3/11/25 6:38 AM, Josh Don wrote:
Thanks Abel,
@@ -7481,12 +7481,13 @@ sched_balance_find_dst_group_cpu(struct sched_group *group, struct task_struct *
latest_idle_timestamp = rq->idle_stamp;
shallowest_idle_cpu = i;
}
- } else if (shallowest_idle_cpu == -1 && si_cpu == -1) {
- if (sched_idle_cpu(i)) {
- si_cpu = i;
- continue;
- }
-
+ } else if (shallowest_idle_cpu == -1) {
+ /*
+ * The SCHED_IDLE cpus do not necessarily means anything
+ * to @p due to the cgroup hierarchical behavior. But it
+ * is almost certain that the wakee will get better served
+ * if the cpu is less loaded.
+ */
load = cpu_load(cpu_rq(i));
if (load < min_load) {
min_load = load;
This seems reasonable due to the case you describe. However, I'm
wondering if you considered any heuristics here to help identify when
a target cpu should really be considered sched_idle from the
perspective of the incoming task. For example, in your cgroup
hierarchy, if you have a cpu currently only running tasks in your
besteffort container (and all cpus in the system are busy running
something), then that cpu should be considered as a good target for a
waking task in the "guaranteed" container, and not a good target for a
waking task in the "containerd" container. A simple way to do this
Yes, it actually did cost me several days trying hard to figure out
whether there is a proper way to achieve what you're suggesting.
Solution A
----------
Mark all the hierarchically idle task_groups by assigning a unique
prime, and define a tree walk starts from @n to root that contains
all the preemptable nodes.
struct task_group {
/*
* Set to a unique prime if at least 1 ancestor is idle,
* otherwise set to 1.
*/
u64 prime;
u64 mul;
};
/* Called by sched_group_set_idle() */
static void calculate_mul(struct task_group *tg)
{
struct task_group *curr, *iter;
lockdep_assert_held(&shares_mutex);
for (curr = tg; curr->parent; curr = curr->parent) {
list_for_each_entry(iter, &curr->parent->children, siblings) {
/* Can't preempt non-idle siblings */
if (!iter->idle)
continue;
/*
* For each node in the subtree rooted at @iter do:
* tg->mul *= node->prime;
*/
traverse_subtree(tg, iter);
}
}
}
int sched_idle_cpu(int cpu, struct task_struct *p)
{
/* rq::cached_mul caches rq->donor's tg::mul */
return p->sched_task_group.mul % cpu_rq(cpu)->cached_mul == 0;
}
With this we even can drop find_matching_se(), since it's quite easy
to know whether or not a sched_entity can preempt another. But sadly
task_group::mul will quickly get overflowed.
Solution B
----------
Since we do not require 100% accurate, solution A can be shifted to
using bloom filters.
struct task_group {
u64 key; /* A random number used for hashing */
u64 filter;
};
/* Called by sched_group_set_idle() */
static void calculate_filter(struct task_group *tg)
{
struct task_group *curr, *iter;
lockdep_assert_held(&shares_mutex);
for (curr = tg; curr->parent; curr = curr->parent) {
list_for_each_entry(iter, &curr->parent->children, siblings) {
/* Can't preempt non-idle siblings */
if (!iter->idle)
continue;
/*
* For each node in the subtree rooted at @iter do:
* set_bit(bloom_hash(node->key), &tg->filter);
*/
traverse_subtree(tg, iter);
}
}
}
int sched_idle_cpu(int cpu, struct task_struct *p)
{
u64 filter = p->sched_task_group.filter;
u64 cached = cpu_rq(cpu)->cached_filter;
return filter & cached == cached;
}
False positives are possible, but the possibility can be reduced by
optimizing blooming setup.
I chose the simplest way for now to workaround the issue we encountered,
while I am still trying to do something to get rid of sched_idle_cpu().
Thoughts?
would be to do a find_matching_se on the incoming task and the current
on_cpu task. That does have a drawback of cgroup pointer chasing in a
hot wakeup path, so I don't love it.
Neither do I. But if we go through find_matching_se(), I think we need
this rq locked.
In any case, I'm fine with the change as-is, mostly curious if you
gave any additional thought to the case mentioned above.
Thanks!
Abel