Re: [PATCH] sched/fair: Skip wake_affine() for core siblings

From: Mike Galbraith
Date: Sat Sep 26 2015 - 11:25:22 EST


On Fri, 2015-09-25 at 20:54 +0300, Kirill Tkhai wrote:
> We are not interested in actual target if both prev
> and curr cpus share CPU cache. select_idle_sibling()
> searches in top-down order; top level is the same
> for both of them, and the result will be the same.
> So, we can save a little CPU cycles and cache misses
> and skip wake_affine() calculations.

But, whereas previously wake_affine() could NAK a migration if it would
create an imbalance, we'll now just go ahead and stack tasks if
select_idle_sibling() can't find an idle home to override the blanket
approval. It doesn't look like a good idea to me to bounce tasks around
only to then perhaps stack them, as if we do stack waker/wakee, we
certainly lose concurrency. (microbenchmarks like pipe-test love that,
but not all that many real applications play ping-pong for a living;)

I spent most of the day piddling with your little patch, so I'll post
some condensed mixed load notes.

concurrent tbench 4 + pgbench, 30 seconds per client count (i4790+smt)
master master+
pgbench 1 2 3 avg 1 2 3 avg comp
clients 1 tps = 18768 18591 18264 18541 18351 17257 17245 17617 .950
clients 2 tps = 30779 30661 31016 30818 29112 28026 29026 28721 .931
clients 4 tps = 54195 55100 54048 54447 53290 52336 52930 52852 .970
clients 8 tps = 60332 67052 64699 64027 38491 35746 37746 37327 .582!!

Do the opposite, wake_affine() always NAKs.
master master++
pgbench 1 2 3 avg 1 2 3 avg comp
clients 1 tps = 18768 18591 18264 18541 16874 16865 16665 16801 .906
clients 2 tps = 30779 30661 31016 30818 33562 33546 33681 33596 1.090
clients 4 tps = 54195 55100 54048 54447 61544 61482 61117 61381 1.127
clients 8 tps = 60332 67052 64699 64027 75171 75524 75318 75337 1.176

...

virgin vs your patch again, 2 _minutes_ per client count, as I noticed much variance at 8
clients, where wake_wide() is supposed to kick in to keep N:M load spread out.

master master+
pgbench 1 2 3 avg 1 2 3 avg comp
clients 1 tps = 18548 18673 18390 18537 17879 17652 17621 17717 .955
clients 2 tps = 31083 31110 30859 31017 30274 30003 29796 30024 .967
clients 4 tps = 53107 53156 53601 53288 52658 53024 53449 53043 .995
clients 8 tps = 34213 34310 28844 32455 31360 31416 30732 31169 .960

30 seconds per run isn't enough, and wake_wide() is not doing a wonderful job for 1:N pgbench.

hrmph, twiddle...

waker/wakee coupling strengthened
postgres@homer:~> pgbench.sh
clients 1 tps = 18035
clients 2 tps = 32525
clients 4 tps = 53246
clients 8 tps = 37278

better, but not enough.. + sd_llc_size = #cores vs #threads
postgres@homer:~> pgbench.sh
clients 1 tps = 18482
clients 2 tps = 32366
clients 4 tps = 54557
clients 8 tps = 69643

Ok, that's what I want to see, full repeat.
master = twiddle
master+ = twiddle+patch

concurrent tbench 4 + pgbench, 2 minutes per client count (i4790+smt)
master master+
pgbench 1 2 3 avg 1 2 3 avg comp
clients 1 tps = 18599 18627 18532 18586 17480 17682 17606 17589 .946
clients 2 tps = 32344 32313 32408 32355 25167 26140 23730 25012 .773
clients 4 tps = 52593 51390 51095 51692 22983 23046 22427 22818 .441
clients 8 tps = 70354 69583 70107 70014 66924 66672 69310 67635 .966

Hrm... turn the tables, measure tbench while pgbench 4 client load runs endlessly.

master master+
tbench 1 2 3 avg 1 2 3 avg comp
pairs 1 MB/s = 430 426 436 430 481 481 494 485 1.127
pairs 2 MB/s = 1083 1085 1072 1080 1086 1090 1083 1086 1.005
pairs 4 MB/s = 1725 1697 1729 1717 2023 2002 2006 2010 1.170
pairs 8 MB/s = 2740 2631 2700 2690 3016 2977 3071 3021 1.123

tbench without competition
master master+ comp
pairs 1 MB/s = 694 692 .997
pairs 2 MB/s = 1268 1259 .992
pairs 4 MB/s = 2210 2165 .979
pairs 8 MB/s = 3586 3526 .983 (yawn, all within routine variance)

twiddle:

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6048,14 +6048,18 @@ static void update_top_cache_domain(int
{
struct sched_domain *sd;
struct sched_domain *busy_sd = NULL;
+ struct sched_group *group;
int id = cpu;
int size = 1;

sd = highest_flag_domain(cpu, SD_SHARE_PKG_RESOURCES);
if (sd) {
id = cpumask_first(sched_domain_span(sd));
- size = cpumask_weight(sched_domain_span(sd));
busy_sd = sd->parent; /* sd_busy */
+ group = sd->groups;
+ /* Set size to the number of cores, not threads */
+ while (group = group->next, group != sd->groups)
+ size++;
}
rcu_assign_pointer(per_cpu(sd_busy, cpu), busy_sd);

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4421,19 +4421,26 @@ static unsigned long cpu_avg_load_per_ta

static void record_wakee(struct task_struct *p)
{
+ unsigned long now = jiffies;
+
/*
* Rough decay (wiping) for cost saving, don't worry
* about the boundary, really active task won't care
* about the loss.
*/
- if (time_after(jiffies, current->wakee_flip_decay_ts + HZ)) {
+ if (time_after(now, current->wakee_flip_decay_ts + HZ)) {
current->wakee_flips >>= 1;
- current->wakee_flip_decay_ts = jiffies;
+ current->wakee_flip_decay_ts = now;
+ }
+ if (time_after(now, p->wakee_flip_decay_ts + HZ)) {
+ p->wakee_flips >>= 1;
+ p->wakee_flip_decay_ts = now;
}

if (current->last_wakee != p) {
current->last_wakee = p;
current->wakee_flips++;
+ p->wakee_flips++;
}
}



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/