Re: [PATCH v4 6/7] sched/fair: skip busy cores in SIS search

From: Abel Wu
Date: Mon Jun 27 2022 - 06:13:33 EST



On 6/24/22 11:30 AM, Chen Yu Wrote:
...
@@ -9273,8 +9319,40 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
static void sd_update_state(struct lb_env *env, struct sd_lb_stats *sds)
{
- if (sds->sd_state == sd_has_icpus && !test_idle_cpus(env->dst_cpu))
- set_idle_cpus(env->dst_cpu, true);
+ struct sched_domain_shared *sd_smt_shared = env->sd->shared;
+ enum sd_state new = sds->sd_state;
+ int this = env->dst_cpu;
+
+ /*
+ * Parallel updating can hardly contribute accuracy to
+ * the filter, besides it can be one of the burdens on
+ * cache traffic.
+ */
+ if (cmpxchg(&sd_smt_shared->updating, 0, 1))
+ return;
+
+ /*
+ * There is at least one unoccupied cpu available, so
+ * propagate it to the filter to avoid false negative
+ * issue which could result in lost tracking of some
+ * idle cpus thus throughupt downgraded.
+ */
+ if (new != sd_is_busy) {
+ if (!test_idle_cpus(this))
+ set_idle_cpus(this, true);
+ } else {
+ /*
+ * Nothing changes so nothing to update or
+ * propagate.
+ */
+ if (sd_smt_shared->state == sd_is_busy)
+ goto out;
+ }
+
+ sd_update_icpus(this, sds->idle_cpu);
I wonder if we could further enhance it to facilitate idle CPU scan.
For example, can we propagate the idle CPUs in smt domain, to its parent
domain in a hierarchic sequence, and finally to the LLC domain. If there is

In fact, it was my first try to cache the unoccupied cpus in SMT
shared domain, but the overhead of cpumask ops seems like a major
stumbling block.

a cluster domain between SMT and LLC domain, the cluster domain idle CPU filter
could benefit from this mechanism.
https://lore.kernel.org/lkml/20220609120622.47724-3-yangyicong@xxxxxxxxxxxxx/

Putting SIS into a hierarchical pattern is good for cache locality.
But I don't think multi-level filter is appropriate since it could
bring too much cache traffic in SIS,
Could you please elaborate a little more about the cache traffic? I thought we
don't save the unoccupied cpus in SMT shared domain, but to store it in middle
layer shared domain, say, cluster->idle_cpus, this would reduce cache write
contention compared to writing to llc->idle_cpus directly, because a smaller
set of CPUs share the idle_cpus filter. Similarly, SIS can only scan the cluster->idle_cpus
first, without having to query the llc->idle_cpus. This looks like splitting
a big lock into fine grain small lock.

I'm afraid I didn't quite follow.. Did you mean replace the LLC filter
with multiple cluster filters? Then I agree with what you suggested
that the contention would be reduced. But there are other concerns:

a. Is it appropriate to fake an intermediate sched_domain if
cluster level doesn't available? How to identify the proper
size of the faked sched_domain?

b. The SIS path might touch more cachelines (multiple cluster
filters). I'm not sure how much is the impact.

Whatever, this seems worth a try. :)

and it could be expected to be
a disaster for netperf/tbench or the workloads suffering frequent
context switches.

So this overhead comes from the NEWLY_IDLE case?


Yes, I think it's the main cause to rise the contention to new heights.
But it's also important to keep the filter fresh.

Thanks & BR,
Abel