Re: sched: tweak select_idle_sibling to look for idle threads

From: Peter Zijlstra
Date: Wed May 04 2016 - 06:37:14 EST

Next message: Vineet Gupta: "kmap_atomic and preemption"
Previous message: Borislav Petkov: "Re: [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table"
In reply to: Chris Mason: "Re: sched: tweak select_idle_sibling to look for idle threads"
Next in thread: Peter Zijlstra: "Re: sched: tweak select_idle_sibling to look for idle threads"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, May 03, 2016 at 11:11:53AM -0400, Chris Mason wrote:

> > + if (cpu_rq(cpu)->cfs.nr_running > 1)
> > + return 1;
>
> The nr_running check is interesting. It is supposed to give the same
> benefit as your "do we have anything idle?" variable, but without having
> to constantly update a variable somewhere. I'll have to do a few runs
> to verify (maybe a idle_scan_failed counter).

Right; I got that.. I tried it and it doesn't seem to work as well. But
yeah, its better than nothing.

The reason I'm not too worried about the update_idle_core() thing is
that SMT threads share L1 so touching their sibling state isn't
typically expensive.

But yes, I'd need to find someone with SMT8 or something daft like that
to try this.

> The task_hot checks don't do much for the sleeping schbench runs, but
> they help a lot for this:
>
> # pick a single core, in my case cpus 0,20 are the same core
> # cpu_hog is any program that spins
> #
> taskset -c 20 cpu_hog &
>
> # schbench -p 4 means message passing mode with 4 byte messages (like
> # pipe test), no sleeps, just bouncing as fast as it can.
> #
> # make the scheduler choose between the sibling of the hog and cpu 1
> #
> taskset -c 0,1 schbench -p 4 -m 1 -t 1
>
> Current mainline will stuff both schbench threads onto CPU 1, leaving
> CPU 0 100% idle. My first patch with the minimal task_hot() checks
> would sometimes pick CPU 0. My second patch that just directly calls
> task_hot sticks to cpu1, which is ~3x faster than spreading it.

Urgh, another benchmark to play with ;-)

> The full task_hot() checks also really help tbench.

tbench wants select_idle_siblings() to just not exist; it goes happy
when you just return target.

tbench:

old (mainline like):

Throughput 875.822 MB/sec 2 clients 2 procs max_latency=0.117 ms
Throughput 2017.57 MB/sec 5 clients 5 procs max_latency=0.057 ms
Throughput 3954.66 MB/sec 10 clients 10 procs max_latency=0.094 ms
Throughput 5886.11 MB/sec 20 clients 20 procs max_latency=0.088 ms
Throughput 9095.57 MB/sec 40 clients 40 procs max_latency=0.864 ms

new:

Throughput 876.794 MB/sec 2 clients 2 procs max_latency=0.102 ms
Throughput 2048.73 MB/sec 5 clients 5 procs max_latency=0.095 ms
Throughput 3802.69 MB/sec 10 clients 10 procs max_latency=0.113 ms
Throughput 5521.81 MB/sec 20 clients 20 procs max_latency=0.091 ms
Throughput 10331.8 MB/sec 40 clients 40 procs max_latency=0.444 ms

nothing:

Throughput 759.532 MB/sec 2 clients 2 procs max_latency=0.210 ms
Throughput 1884.01 MB/sec 5 clients 5 procs max_latency=0.094 ms
Throughput 3931.31 MB/sec 10 clients 10 procs max_latency=0.091 ms
Throughput 6478.81 MB/sec 20 clients 20 procs max_latency=0.110 ms
Throughput 10001 MB/sec 40 clients 40 procs max_latency=0.148 ms

See the 20 client have a happy moment ;-) [ivb-ep: 2*10*2]

I've not quite figured out how to make the new bits switch off aggressive
enough to make tbench happy without hurting the others. More numbers:

sysbench-oltp-psql:

old (mainline like):

2: [30 secs] transactions: 53556 (1785.19 per sec.)
5: [30 secs] transactions: 118957 (3965.08 per sec.)
10: [30 secs] transactions: 241126 (8037.22 per sec.)
20: [30 secs] transactions: 383256 (12774.63 per sec.)
40: [30 secs] transactions: 539705 (17989.05 per sec.)
80: [30 secs] transactions: 541833 (18059.16 per sec.)

new:

2: [30 secs] transactions: 53012 (1767.03 per sec.)
5: [30 secs] transactions: 122057 (4068.49 per sec.)
10: [30 secs] transactions: 235781 (7859.09 per sec.)
20: [30 secs] transactions: 355967 (11864.99 per sec.)
40: [30 secs] transactions: 537327 (17909.80 per sec.)
80: [30 secs] transactions: 546017 (18198.82 per sec.)

schbench -m2 -t 20 -c 30000 -s 30000 -r 30:

old (mainline like):

Latency percentiles (usec)
50.0000th: 102
75.0000th: 109
90.0000th: 115
95.0000th: 118
*99.0000th: 5352
99.5000th: 12112
99.9000th: 13008
Over=0, min=0, max=27238

new:

Latency percentiles (usec)
50.0000th: 103
75.0000th: 109
90.0000th: 114
95.0000th: 116
*99.0000th: 120
99.5000th: 121
99.9000th: 124
Over=0, min=0, max=12939

---
include/linux/sched.h | 2 +
kernel/sched/core.c | 3 +
kernel/sched/fair.c | 267 +++++++++++++++++++++++++++++++++++++++--------
kernel/sched/features.h | 9 ++
kernel/sched/idle_task.c | 4 +-
kernel/sched/sched.h | 1 +
kernel/time/tick-sched.c | 10 +-
7 files changed, 246 insertions(+), 50 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index ad9454d..e7ce1a0 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1068,6 +1068,8 @@ struct sched_domain {
u64 max_newidle_lb_cost;
unsigned long next_decay_max_lb_cost;

+ u64 avg_scan_cost; /* select_idle_sibling */
+
#ifdef CONFIG_SCHEDSTATS
/* load_balance() stats */
unsigned int lb_count[CPU_MAX_IDLE_TYPES];
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c82ca6e..280e73e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7278,6 +7278,7 @@ static struct kmem_cache *task_group_cache __read_mostly;
#endif

DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
+DECLARE_PER_CPU(cpumask_var_t, select_idle_mask);

void __init sched_init(void)
{
@@ -7314,6 +7315,8 @@ void __init sched_init(void)
for_each_possible_cpu(i) {
per_cpu(load_balance_mask, i) = (cpumask_var_t)kzalloc_node(
cpumask_size(), GFP_KERNEL, cpu_to_node(i));
+ per_cpu(select_idle_mask, i) = (cpumask_var_t)kzalloc_node(
+ cpumask_size(), GFP_KERNEL, cpu_to_node(i));
}
#endif /* CONFIG_CPUMASK_OFFSTACK */

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b8a33ab..9290fc8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1501,8 +1501,10 @@ balance:
* One idle CPU per node is evaluated for a task numa move.
* Call select_idle_sibling to maybe find a better one.
*/
- if (!cur)
+ if (!cur) {
+ // XXX borken
env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu);
+ }

assign:
assigned = true;
@@ -4491,6 +4493,11 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
}

#ifdef CONFIG_SMP
+
+/* Working cpumask for: load_balance, load_balance_newidle. */
+DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
+DEFINE_PER_CPU(cpumask_var_t, select_idle_mask);
+
#ifdef CONFIG_NO_HZ_COMMON
/*
* per rq 'load' arrray crap; XXX kill this.
@@ -5162,65 +5169,240 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu;
}

+static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, int *wrapped)
+{
+ int next;
+
+again:
+ next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1);
+
+ if (*wrapped) {
+ if (next >= start)
+ return nr_cpumask_bits;
+ } else {
+ if (next >= nr_cpumask_bits) {
+ *wrapped = 1;
+ n = -1;
+ goto again;
+ }
+ }
+
+ return next;
+}
+
+#define for_each_cpu_wrap(cpu, mask, start, wrap) \
+ for ((wrap) = 0, (cpu) = (start)-1; \
+ (cpu) = cpumask_next_wrap((cpu), (mask), (start), &(wrap)), \
+ (cpu) < nr_cpumask_bits; )
+
+#ifdef CONFIG_SCHED_SMT
+
+static inline void clear_idle_cores(int cpu)
+{
+ struct sched_domain *sd = rcu_dereference(per_cpu(sd_busy, cpu));
+ if (!sd)
+ return;
+
+ WRITE_ONCE(sd->groups->sgc->has_idle_cores, 0);
+}
+
+static inline void set_idle_cores(int cpu)
+{
+ struct sched_domain *sd = rcu_dereference(per_cpu(sd_busy, cpu));
+ if (!sd)
+ return;
+
+ WRITE_ONCE(sd->groups->sgc->has_idle_cores, 1);
+}
+
+static inline bool test_idle_cores(int cpu)
+{
+ struct sched_domain *sd = rcu_dereference(per_cpu(sd_busy, cpu));
+ if (!sd)
+ return false;
+
+ // XXX static key for !SMT topologies
+
+ return READ_ONCE(sd->groups->sgc->has_idle_cores);
+}
+
+void update_idle_core(struct rq *rq)
+{
+ int core = cpu_of(rq);
+ int cpu;
+
+ rcu_read_lock();
+ if (test_idle_cores(core))
+ goto unlock;
+
+ for_each_cpu(cpu, cpu_smt_mask(core)) {
+ if (cpu == core)
+ continue;
+
+ if (!idle_cpu(cpu))
+ goto unlock;
+ }
+
+ set_idle_cores(core);
+unlock:
+ rcu_read_unlock();
+}
+
+static int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
+{
+ struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
+ int core, cpu, wrap;
+
+ if (!test_idle_cores(target))
+ return -1;
+
+ cpumask_and(cpus, sched_domain_span(sd), tsk_cpus_allowed(p));
+
+ for_each_cpu_wrap(core, cpus, target, wrap) {
+ bool idle = true;
+
+ for_each_cpu(cpu, cpu_smt_mask(core)) {
+ cpumask_clear_cpu(cpu, cpus);
+ if (!idle_cpu(cpu))
+ idle = false;
+ }
+
+ if (idle)
+ return core;
+ }
+
+ /*
+ * Failed to find an idle core; stop looking for one.
+ */
+ clear_idle_cores(target);
+
+ return -1;
+}
+
+#else /* CONFIG_SCHED_SMT */
+
+void update_idle_core(struct rq *rq) { }
+
+static inline int select_idle_core(struct task_struct *p, struct sched_domain *sd, int target)
+{
+ return -1;
+}
+
+#endif /* CONFIG_SCHED_SMT */
+
+static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, int target)
+{
+ struct sched_domain *this_sd = rcu_dereference(*this_cpu_ptr(&sd_llc));
+ u64 time, cost;
+ s64 delta;
+ int cpu, wrap;
+
+ if (sched_feat(AVG_CPU)) {
+ u64 avg_idle = this_rq()->avg_idle;
+ u64 avg_cost = this_sd->avg_scan_cost;
+
+ if (sched_feat(PRINT_AVG))
+ trace_printk("idle: %Ld cost: %Ld\n", avg_idle, avg_cost);
+
+ if (avg_idle / 32 < avg_cost)
+ return -1;
+ }
+
+ time = local_clock();
+
+ for_each_cpu_wrap(cpu, sched_domain_span(sd), target, wrap) {
+ if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+ continue;
+ if (idle_cpu(cpu))
+ break;
+ }
+
+ time = local_clock() - time;
+ cost = this_sd->avg_scan_cost;
+ delta = (s64)(time - cost) / 8;
+ /* trace_printk("time: %Ld cost: %Ld delta: %Ld\n", time, cost, delta); */
+ this_sd->avg_scan_cost += delta;
+
+ return cpu;
+}
+
/*
- * Try and locate an idle CPU in the sched_domain.
+ * Try and locate an idle core/thread in the LLC cache domain.
*/
static int select_idle_sibling(struct task_struct *p, int target)
{
struct sched_domain *sd;
- struct sched_group *sg;
- int i = task_cpu(p);
+ int start, i = task_cpu(p);

if (idle_cpu(target))
return target;

/*
- * If the prevous cpu is cache affine and idle, don't be stupid.
+ * If the previous cpu is cache affine and idle, don't be stupid.
*/
if (i != target && cpus_share_cache(i, target) && idle_cpu(i))
return i;

- /*
- * Otherwise, iterate the domains and find an eligible idle cpu.
- *
- * A completely idle sched group at higher domains is more
- * desirable than an idle group at a lower level, because lower
- * domains have smaller groups and usually share hardware
- * resources which causes tasks to contend on them, e.g. x86
- * hyperthread siblings in the lowest domain (SMT) can contend
- * on the shared cpu pipeline.
- *
- * However, while we prefer idle groups at higher domains
- * finding an idle cpu at the lowest domain is still better than
- * returning 'target', which we've already established, isn't
- * idle.
- */
- sd = rcu_dereference(per_cpu(sd_llc, target));
- for_each_lower_domain(sd) {
- sg = sd->groups;
- do {
- if (!cpumask_intersects(sched_group_cpus(sg),
- tsk_cpus_allowed(p)))
- goto next;
+ start = target;
+ if (sched_feat(ORDER_IDLE))
+ start = per_cpu(sd_llc_id, target); /* first cpu in llc domain */

- /* Ensure the entire group is idle */
- for_each_cpu(i, sched_group_cpus(sg)) {
- if (i == target || !idle_cpu(i))
+ sd = rcu_dereference(per_cpu(sd_llc, start));
+ if (!sd)
+ return target;
+
+ if (sched_feat(OLD_IDLE)) {
+ struct sched_group *sg;
+
+ for_each_lower_domain(sd) {
+ sg = sd->groups;
+ do {
+ if (!cpumask_intersects(sched_group_cpus(sg),
+ tsk_cpus_allowed(p)))
goto next;
- }

- /*
- * It doesn't matter which cpu we pick, the
- * whole group is idle.
- */
- target = cpumask_first_and(sched_group_cpus(sg),
- tsk_cpus_allowed(p));
- goto done;
+ /* Ensure the entire group is idle */
+ for_each_cpu(i, sched_group_cpus(sg)) {
+ if (i == target || !idle_cpu(i))
+ goto next;
+ }
+
+ /*
+ * It doesn't matter which cpu we pick, the
+ * whole group is idle.
+ */
+ target = cpumask_first_and(sched_group_cpus(sg),
+ tsk_cpus_allowed(p));
+ goto done;
next:
- sg = sg->next;
- } while (sg != sd->groups);
- }
+ sg = sg->next;
+ } while (sg != sd->groups);
+ }
done:
+ return target;
+ }
+
+ if (sched_feat(IDLE_CORE)) {
+ i = select_idle_core(p, sd, start);
+ if ((unsigned)i < nr_cpumask_bits)
+ return i;
+ }
+
+ if (sched_feat(IDLE_CPU)) {
+ i = select_idle_cpu(p, sd, start);
+ if ((unsigned)i < nr_cpumask_bits)
+ return i;
+ }
+
+ if (sched_feat(IDLE_SMT)) {
+ for_each_cpu(i, cpu_smt_mask(target)) {
+ if (!cpumask_test_cpu(i, tsk_cpus_allowed(p)))
+ continue;
+ if (idle_cpu(i))
+ return i;
+ }
+ }
+
return target;
}

@@ -7229,9 +7411,6 @@ static struct rq *find_busiest_queue(struct lb_env *env,
*/
#define MAX_PINNED_INTERVAL 512

-/* Working cpumask for load_balance and load_balance_newidle. */
-DEFINE_PER_CPU(cpumask_var_t, load_balance_mask);
-
static int need_active_balance(struct lb_env *env)
{
struct sched_domain *sd = env->sd;
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 69631fa..5dd10ec 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -69,3 +69,12 @@ SCHED_FEAT(RT_RUNTIME_SHARE, true)
SCHED_FEAT(LB_MIN, false)
SCHED_FEAT(ATTACH_AGE_LOAD, true)

+SCHED_FEAT(OLD_IDLE, false)
+SCHED_FEAT(ORDER_IDLE, false)
+
+SCHED_FEAT(IDLE_CORE, true)
+SCHED_FEAT(IDLE_CPU, true)
+SCHED_FEAT(AVG_CPU, true)
+SCHED_FEAT(PRINT_AVG, false)
+
+SCHED_FEAT(IDLE_SMT, false)
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index 47ce949..cb394db 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -23,11 +23,13 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
resched_curr(rq);
}

+extern void update_idle_core(struct rq *rq);
+
static struct task_struct *
pick_next_task_idle(struct rq *rq, struct task_struct *prev)
{
put_prev_task(rq, prev);
-
+ update_idle_core(rq);
schedstat_inc(rq, sched_goidle);
return rq->idle;
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 69da6fc..5994794 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -866,6 +866,7 @@ struct sched_group_capacity {
* Number of busy cpus in this group.
*/
atomic_t nr_busy_cpus;
+ int has_idle_cores;

unsigned long cpumask[0]; /* iteration mask */
};
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 31872bc..6e42cd2 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -933,11 +933,11 @@ void tick_nohz_idle_enter(void)
WARN_ON_ONCE(irqs_disabled());

/*
- * Update the idle state in the scheduler domain hierarchy
- * when tick_nohz_stop_sched_tick() is called from the idle loop.
- * State will be updated to busy during the first busy tick after
- * exiting idle.
- */
+ * Update the idle state in the scheduler domain hierarchy
+ * when tick_nohz_stop_sched_tick() is called from the idle loop.
+ * State will be updated to busy during the first busy tick after
+ * exiting idle.
+ */
set_cpu_sd_state_idle();

local_irq_disable();

Next message: Vineet Gupta: "kmap_atomic and preemption"
Previous message: Borislav Petkov: "Re: [BUG] x86/efi: MMRs no longer properly mapped after switch to isolated page table"
In reply to: Chris Mason: "Re: sched: tweak select_idle_sibling to look for idle threads"
Next in thread: Peter Zijlstra: "Re: sched: tweak select_idle_sibling to look for idle threads"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]